### Testing the Network

Now that all the networks have been successfully trained, their models and training data in folders models, model_scores and model_infos we can commence testing.
Each model will be tested on a 50-50 split of male to female testing data, solely male testing set and solely female testing set to compare between each other.

First we will load all the models into memory, this is simply done by insert them all into a list and then initialising a model and then loading each file into that subsequent model.

In [None]:
# Adding in model names
model_names = ["convNN2_ll_training_50-50_split.txt_batch32_ep50_lr0.0001.pt", "convNN2_ll_training_27-75_split.txt_batch32_ep50_lr0.0001.pt", "convNN2_ll_training_75-25_split.txt_batch32_ep50_lr0.0001.pt",
        "convNN2_ll_training_27-75_split.txt_batch32_ep50_lr0.0001.pt", "dense_ll_training_50-50_split.txt_batch32_ep50_lr0.0001.pt", "dense_ll_training_25-75_split.txt_batch32_ep50_lr0.0001.pt",
        "dense_ll_training_75-25_split.txt_batch32_ep50_lr0.0001.pt", "resnet34_ll_training_50-50_split.txt_batch32_ep50_lr0.0001.pt", "resnet34_ll_training_25-75_split.txt_batch32_ep50_lr0.0001.pt",
        "resnet34_ll_training_75-25_split.txt_batch32_ep50_lr0.0001.pt"]

models = [convNN2(), convNN2(), convNN2(), denseNN(), denseNN(), denseNN(), resnet34(), resnet34(), resnet34()]

for name, model in zip(model_names, models):
    model.load_state_dict(torch.load("./models/" + name))

Once we have done this we can now calculate the scores of each model on each test set, this should compute the mean error and std and return it as a tuple

In [None]:
for model in models:
    print(model.__class__.__name__)
    print("50-50 test file " + evaluate(model, "./ll_test_50-50", "cpu"))
    print("male test file " + evaluate(model, "./ll_test_males", "cpu"))
    print("female test file " + evaluate(model, "./ll_test_females", "cpu"))

The error for the convNN2 is much higher than the the other two models, this is due to the simplicity of the model. 
While we can see that ResNet34 outperformed DesneNet121 with a significant reduction in error between the two.
Both models had higher error in females for identifing landmarks for both genders, as expected started this project.

Both models performed best overall on balanced data sets, though this difference was very slight.
It also shows that a higher proportional of females or males in the dataset does not neccesarily result in a lower error for each respective gender, but the change in error was quite minimal 
meaning we can discard the bias in the dataset as a source of the bias between males and females.




Now we can generate graphs for the change in mean error and standard deviation during training, and how it progressed over the iterations.

All the extra saved data, when the model is loaded in and plotted on two seperate graphs.
When loading in the model scores file, we need to parse this data to make it useful to us and is split into a list by using ',' as dividers.
While the model infos are already presented in a json file and ready to read in simply, this gives us the iteration and mean or std of the saved model.

# TODO: Add Loss for each iteration

In [None]:


for model in models:
    epoch = []
    error = []
    std = []
    m_nopt = args.model.split(".pt")
    with open("./model_scores/" + m_nopt[0] + ".csv") as file:
        for line in file:
            scores = line.split(",")   
            epoch.append(float(scores[0]))
            error.append(float(scores[1]))
            std.append(float(scores[2]))
    
    file = open("./model_infos/" + m_nopt[0] + ".json")
    model_info = json.load(file)

    plt.plot(epoch, error)
    plt.xlabel("Iterations")
    plt.ylabel("Mean Error")
    plt.title('Error change during training')

    plt.scatter([model_info["iteration"]], [model_info["mean"]], color = 'red')
    plt.gca().legend(('Validation Error','Best performing iteration'))
    plt.show()

    plt.plot(epoch, std)
    plt.xlabel("Epochs")
    plt.ylabel("Standard Deviation")
    plt.title('STD change during training')
    plt.scatter([model_info["iteration"]], [model_info["std"]], color = 'red')
    plt.gca().legend(('Validation Error','Best performing iteration'))
    plt.show()
    

From the graphs we can see that all error all std over time rapidly decreases within the first few iterations as the weights start off as zero, and the furthest possible away from the correct points (chin, nose and left eye) as possible.
Both the ConvNN2 and DenseNet seem to oscillate between certain points and do not converge to a single point like the Resnet34 does, this seems to indicate that they have not obtained the most optimal model.

We can visually check how accurate and close the predictions of the landmarks are on an image, comparing these against the input landmarks used by the model to train.

In [None]:
def generate_images(train_dataloader, axs_flat):
    with torch.no_grad():
            for i, (image, _, _, _, labels) in enumerate(train_dataloader):
                output = model(image)
                output = output.reshape(3,2)
                image = image.squeeze()
                image = image.permute(1, 2, 0)    #Default was 3,200,200
                im = axs_flat[i].imshow(image)
                x = np.array(range(200))

                # Finding landmarks for chin, nose and eye this txt file for each image
                land_idx = [8, 30, 39]
                labels = labels.squeeze()
                labels = labels[land_idx, :]
                #ax.scatter(output[:,0], output[:,1], linewidth=2, color='red')
                axs_flat[i].scatter(output[:,0], output[:,1], linewidth=2, color='c', s = 5)
                axs_flat[i].scatter(labels[:,0], labels[:,1], linewidth=2, color='m', s = 5)

UTKFace = CustomImageDataset("./test_output_images", 'UTKFace')
train_dataloader = DataLoader(UTKFace, 
                                    batch_size=1, 
                                    shuffle=False)

for model in models:
    print(model.__class__.__name__)
    fig, axs = plt.subplots(2,5, figsize=(20,10))
    axs_flat = axs.flatten()

    generate_images(train_dataloader, axs_flat)
    #plt.subplots_adjust(wspace=0, hspace=0)
    fig.legend(('Predicted output','Expected output'))
    plt.show()

#### Analysis

As we can see the basic convolutional neural network essentially predicts the same three points for every single image, since that gives the best average mean error for the model
as the model is underfitting. Since it doesn't have enough layers, depth or complexity in order to compute, recognise and plots the landmarks on different features of the face. 

Currently the model produces a very high mean error 54.377 and 54.895 for females and males, and in a 200x200 image this is very far of. Thus, convNN2 will not be considered when evaluating our hypothesis.




Both ResNet34 and DenseNet121 vastly outperformed convNN2 and get significantly lower average mean error and standard deviation.
Their errors on different datasets can be seen below
##### ResNet34 Test Data
![ResNet34 Test Data](jupyter_images/resnet34table.jpg) 
##### DenseNet121 Test Data
![DenseNet121 Test Data](jupyter_images/densenet121table.jpg)

From the tables we can see that ResNet had an error about 45% lower than DenseNet, and was a lot more sensitive to data bias from the data sets.
The DenseNet had a higher error for women, regardless of dataset bias. The error on the Female test set was also very insensitive to dataset bias, indicating the model is not properly learning to predict the correct landmarks for women. 


If we look at a graph for both models to get a better understand of what is happening during training

![Both Models Mean Error Change](jupyter_images/modelgraphs.jpg)

ResNet mean error converges to a asymptote while DenseNet continues oscillating. This indicates that DenseNet had some issue during training and did not output a very optimal model. High oscillation tends to indicate that the learning rate is too high or that the training might benefit from using an optimiser with momentum. DenseNet is ostensibly a more sophisticated model as it has both more layers and parameters this did not appear to hold in practice. It appeared the model was unable to properly fit to the female portion of the dataset while ResNet which is comparitively simple did. We theorise this is due to the DenseNet having vanishing gradient issues due to its high layer count. While DenseNet is intended to prevent vanishing gradients, the pytorch implementation only uses dense blocks. The layers inside these blocks possess dense connections but subsequent layers only receive inputs from the previous dense block's output layer. This saves on computational complexity by bringing down the number of weights. However, this could allow vanishing gradients to exits. The ResNet model however used fewer layers and took advantage of skip connections. Since it uses summation as opposed to concatonation to bypass layers, it is more computationally efficient. Due to this, the model does not need to be split into blocks and the outputs from earlier layers can reach all later layers. This theory would explain our concerns about vanishing gradients. Additionally, when we examined the model weights for our trained DenseNets, we found that many of the earlier weight parameters had changed by no more that 10^-4^, further supporting the vanishing gradient issues.


Another issue may be due to underfitting of the model, which may be experimented with by increasing the amount of layers in the network in the future. If there is an increase or no change in mean error, it would highlight that vanishing gradients is at play and the number of layers must be reduced instead.

Differing amounts and types of pooling can be experimented with in the future as ResNet uses max pooling while DenseNet uses average pooling. This could reduce the fidelity and reduce the visibility of contours and features of the face, and as seen by the image output our model performed better on high resolution, sharper images with higher constrast. Additionally, max pooling sharpens and image increasing the constrast making it easier for our model to identify landmarks in the image while average pooling smooths out an image reducing contrast.


Those all of our testing and experiements indicate as suggested by the literature biases in data set is not neccesarily the root cause of discrepancies found between men and women in image classification tasks.
In the data, the change in bias causes only slightly changes to the mean error for both men and women for a [less sensitive ]model like DenseNet the mean error for females is always higher. But with ResNet using a higher female biased data set did cause the mean error for females to be lower than males, though this error was still higher than the mean error for males when using a male biased data set, suggesting that neural network models do struggle more so with females than males as expected.
As stated by papers the issue is in the differing facial structure and subtultey of facial for women which often quite difficult for neural networks to identify.
DenseNet is a model classically used for image classification yet this model had certain issues, potentially indicating that the model is not quite complex enough and failing due to the more intricate features of female faces possibly due to vanishing gradient or underfiting during training. REWORD

--This paragraph confuses me
This project may of highlighted the three landmarks choosen as the chin, nose and eye as potential features that work for identifying both female and males with a low disparity in error between genders and could potentially be utilised in future image classification and facial recognition software and projects. Though further testing is required as this only looks at a small section of the supplied landmarks, and not the whole face or various sections of the face. Once more landmarks are included the small error from our models may increase exponentially and no longer be feasible to use. Also once more landmarks are implemented more complicated models would with more fine tuned features would be required to minimize loss and be sufficient to test our hypothesis.

Since we've only tested a small subset of features from the dataset, its unclear whether, neural nets have a harder time pinpointing all landmarks on women as opposed to men or if there are just a few landmarks that cause issues. For example, does the inaccuracy come from the models struggling to find a woman's jaw because its less defined or does it perform worse on all features.


#### Future Work
Some future work that may be performed is to eliminate other facial features from the list of potential causes for unfair results between males and females. 

Investigate if the unfair accuracy stems from a Neural Network’s lack of capacity to handle excessive number of facial features.

Investigate the minimal number of facial features necessary to perform face recognition fairly.​

Investigate to what extent data can be biased but still achieve face recognition fairly on the identified ideal facial features.

Investigate if fairness is still achieved in our results for individual races.

Verify vanishing gradient by experimenting with smaller models