### Testing the Network

Now that all the networks have been successfully trained, their models and training data in folders models, model_scores and model_infos we can commence testing.
Each model will be tested on a 50-50 split of male to female testing data, soely male testing set and soely female testing set to compare between each other.

First we will load all the models into memory, this is simply done by insert them all into a list and then initialising a model and then loading each file into that subsequent model.

In [None]:
# Adding in model names
model_names = ["convNN2_ll_training_50-50_split.txt_batch32_ep50_lr0.0001.pt", "convNN2_ll_training_27-75_split.txt_batch32_ep50_lr0.0001.pt", "convNN2_ll_training_75-25_split.txt_batch32_ep50_lr0.0001.pt",
        "convNN2_ll_training_27-75_split.txt_batch32_ep50_lr0.0001.pt", "dense_ll_training_50-50_split.txt_batch32_ep50_lr0.0001.pt", "dense_ll_training_25-75_split.txt_batch32_ep50_lr0.0001.pt",
        "dense_ll_training_75-25_split.txt_batch32_ep50_lr0.0001.pt", "resnet34_ll_training_50-50_split.txt_batch32_ep50_lr0.0001.pt", "resnet34_ll_training_25-75_split.txt_batch32_ep50_lr0.0001.pt",
        "resnet34_ll_training_75-25_split.txt_batch32_ep50_lr0.0001.pt"]

models = [convNN2(), convNN2(), convNN2(), denseNN(), denseNN(), denseNN(), resnet34(), resnet34(), resnet34()]

for name, model in zip(model_names, models):
    model.load_state_dict(torch.load("./models/" + name))

Once we have done this we can now calculate the scores of each model on each test set, this should compute the mean error and std and return it as a tuple

In [None]:
for model in models:
    print(model.__class__.__name__)
    print("50-50 test file " + evaluate(model, "./ll_test_50-50", "cpu"))
    print("male test file " + evaluate(model, "./ll_test_males", "cpu"))
    print("female test file " + evaluate(model, "./ll_test_females", "cpu"))

The error for the convNN2 is much higher than the the other two models, this is due to the simplicity of the model. 
While we can see that ResNet34 outperformed DesneNet121 with a significant reduction in error between the two.
Both models had higher error in females for identifing landmarks for both genders, as expected started this project.

Both models performed best overall on balanced data sets, though this difference was very slight.
It also shows that a higher proportional of females or males in the dataset does not neccesarily result in a lower error for each respective gender, but the change in error was quite minimal 
meaning we can discard the bias in the dataset as a source of the bias between males and females.




Now we can generate graphs for the change in mean error and standard deviation during training, and how it progressed over the iterations.

All the extra saved data, when the model is loaded in and plotted on two seperate graphs.
When loading in the model scores file, we need to parse this data to make it useful to us and is split into a list by using ',' as dividers.
While the model infos are already presented in a json file and ready to read in simply, this gives us the iteration and mean or std of the saved model.

In [None]:


for model in models:
    epoch = []
    error = []
    std = []
    m_nopt = args.model.split(".pt")
    with open("./model_scores/" + m_nopt[0] + ".csv") as file:
        for line in file:
            scores = line.split(",")   
            epoch.append(float(scores[0]))
            error.append(float(scores[1]))
            std.append(float(scores[2]))
    
    file = open("./model_infos/" + m_nopt[0] + ".json")
    model_info = json.load(file)

    plt.plot(epoch, error)
    plt.xlabel("Iterations")
    plt.ylabel("Mean Error")
    plt.title('Error change during training')

    plt.scatter([model_info["iteration"]], [model_info["mean"]], color = 'red')
    plt.gca().legend(('Validation Error','Best performing iteration'))
    plt.show()

    plt.plot(epoch, std)
    plt.xlabel("Epochs")
    plt.ylabel("Standard Deviation")
    plt.title('STD change during training')
    plt.scatter([model_info["iteration"]], [model_info["std"]], color = 'red')
    plt.gca().legend(('Validation Error','Best performing iteration'))
    plt.show()
    

From the graphs we can see that all error all std over time rapidly decreases within the first few iterations as the weights start off as zero, and the furthest possible away from the correct points (chin, nose and left eye) as possible.
Both the ConvNN2 and DenseNet seem to oscillate between certain points and do not converge to a single point like the Resnet34 does, this seems to indicate that they have not obtained the most optimal model.

We can visually check how accurate and close the predictions of the landmarks are on an image, comparing these against the input landmarks used by the model to train.

In [None]:
def generate_images(train_dataloader, axs_flat):
    with torch.no_grad():
            for i, (image, _, _, _, labels) in enumerate(train_dataloader):
                output = model(image)
                output = output.reshape(3,2)
                image = image.squeeze()
                image = image.permute(1, 2, 0)    #Default was 3,200,200
                im = axs_flat[i].imshow(image)
                x = np.array(range(200))

                # Finding landmarks for chin, nose and eye this txt file for each image
                land_idx = [8, 30, 39]
                labels = labels.squeeze()
                labels = labels[land_idx, :]
                #ax.scatter(output[:,0], output[:,1], linewidth=2, color='red')
                axs_flat[i].scatter(output[:,0], output[:,1], linewidth=2, color='c', s = 5)
                axs_flat[i].scatter(labels[:,0], labels[:,1], linewidth=2, color='m', s = 5)

UTKFace = CustomImageDataset("./test_output_images", 'UTKFace')
        train_dataloader = DataLoader(UTKFace, 
                                        batch_size=1, 
                                        shuffle=False)

for model in models:
    print(model.__class__.__name__)
    fig, axs = plt.subplots(2,5, figsize=(20,10))
    axs_flat = axs.flatten()

    generate_images(train_dataloader, axs_flat)
    #plt.subplots_adjust(wspace=0, hspace=0)
    fig.legend(('Predicted output','Expected output'))
    plt.show()


#### Analysis

As we can see the basic convolutional neural network essentially predicts the same 3 points for every single image, since that gives the best average mean error for the model
as the model is underfitting. Since it doesn't have enough layers, depth or complexity in order to compute, recognise and plots the landmarks on different features of the face. 

Currently the model produces a very high mean error 54.377 and 54.895 for females and males, and in a 200x200 image this is very far of. 


There is not much that can be done to improve on since the error may reduce slightly as more layers are added, but since it is a basic model the error would start going back soon after adding more layers
due to the problem of vanishing gradient and potentially overfitting of the model. This would be a problem as when calculating the product of the derivative for the weights during backpropogation
decreases to zero until the partial derivative of the loss function approaches zero and the partial derivative disappears.

One of the main reason resnet and densenet were developed was to attempt to solve this vanishing gradient issue in neural networks, and as such will be more appropriate in this example.
Thus, convNN2 will not be considered when evaluating our hypothesis.

This basic neural net was useful in testing the limits of the possible calculations for a neural network in performing a more specialised task, additionally it forced us to explore other
neural network models that were more useful. Additionally, it gave us an insight into the amount of work and research done into various nets and trying to improve them.




Both ResNet34 and DenseNet121 vastly outperformed convNN2 and get significantly lower average mean error and standard deviation.
Their errors on different datasets can be seen below
##### ResNet34 Test Data
![ResNet34 Test Data](jupyter_images/resnet34table.jpg) 
##### DenseNet121 Test Data
![DenseNet121 Test Data](jupyter_images/densenet121table.jpg)

From the tables we can see that ResNet had an error about 45% lower than DenseNet, and was a lot more sensitive to data bias from the data sets.
The DenseNet has regardless of data set bias higher mean error for women than men, while for ResNet this error flipped with datasets that utilised a larger subset of women than men.

If we look at a graph for 
![Both Models Mean Error Change](jupyter_images/modelgraphs.jpg)

not uniquely data issue - issues fitting model & disprepancy m & f, across all training dataset regardless of bias
- densenet normally used image classification, issue in comonly used nn - not the root cause
densenet underfitting - sensitive enough to facial structure/subultely for women - vanishing gradient (model not complex enough)

-need more complicated & sensitive model

-resnet fit well more affected by dataset bias - fair consisntel
-skip connections cant travel right from start to end like densenet (denseblocks)


#### Future Work
-main focus differences in accuracy betweeen m & f 
-- hyperparametres not fine tuned - can be improved in future (difference in bias, rather than relative performacne)

-adding more layers to densenet
-reduce layers
-try out different pooling algorithms