In [6]:
# %%html
# <!-- This cell makes the font bigger to make it easy to read. Adjust to taste -->
# <style>
# .cell, .CodeMirror pre{ 
#     font-size: 150%;
#     line-height: 125%;
# }
# </style>

# COSC470 Assignment 2, 2018

## Name: Robbie Cook
## Due Date: Monday September 24th

For assignment 2 you need to implement machine learning algorithm(s) to label faces according to:
- sex (male/female)
- age (child/teen/adult/senior)
- expression (smiling/serious)

A data set from MIT is made available, along with code to read the images and labels into `numpy` arrays. 
These arrays are divided into training, validation, and testing data sets.

You may use any machine learning algorithms you like to classify the faces.
Techniques you may find useful that we've looked at include:
- Decision trees and random forests
- Boosting (and AdaBoost in particular)
- Support Vector Machines (SVMs)
- Face detection (to focus on the key parts of the image)
- EigenFaces (for dimensionality reduction)
- Neural networks in TensorFlow
- CNNs in TensorFlow

## Submission Requirements

You should submit a version of this Notebook renamed to `YourName.ipynb`, so my submission would be `StevenMills.ipynb`. 
You can assume that the same libraries that are available in the COSC470 Anaconda environment on the lab machines are available.
In particular, you can use numpy, scipy, OpenCV, and TensorFlow.

I should be able to open your Notebook and run it. The Notebook should contain the code to construct and train your classifier(s) from the training data (using the validation data appropriately) and then to compute the labels of the training data through a call to `computeLabels`, which has a stub implementation at the end of this notebook. 

## Marking Scheme

A rough marking scheme is given below. This is intentionally fairly open, so that I can give you marks for doing good stuff without having to predetermine what stuff is good.

- 10 marks for the discussion of choice of algorithms and training strategy
- 10 marks for the explanation and clear implementation
- 5 marks for performance

### Algorithm Choice and Training

I will be looking for a description of the algorithm(s) chosen, why you chose that approach, and how you developed, trained and evaluated your method.
You should think about issues such as how to best make use of the training and validation data and how to select parameters for your chosen method.

You are not restricted to a single classifier or method. If you find it useful to determine age labels first and then use that to help determine expression, then that is fine. If you want to use an SVM for sex classification, but a boosted classifier for age, that's also fine.
However, you should discuss why you chose to use the methods you have chosen.

### Explanation and Clear Implementation

You should implement your chosen algorithm(s) using the training and validation data sets provided. 
Jupyter notebooks let you interleave discussion and code, so you should clearly describe how your implementation works.
You can include mathematics if needed using \\(\LaTeX\\)-style markup as demonstrated in the lecture notebooks.
I'll be looking for clear implementations that illustrate good practice in training and evaluation.

It is expected that you will make appropriate use of libraries such as OpenCV and TensorFlow where appropriate, but your explanation should your understanding of these tools clear. 
For example, if you choose to use a convolutional network, you should explain your architecture, how it relates to the code, and give some justification for the various parameters that you need to select when making a CNN.

### Performance

The last cell of the notebook has a function that takes a face data set and produces labels as a result.
You should modify this so that it uses your machine learning algorithms to generate the labels.
I will then use these labels to compare your results to the ground truth.
I may also shuffle the training, validation, and testing data sets around before running your code.

# The Data Set

The following code reads the data into training, testing, and validation sets.
It assumes that the `.zip` of labelled face data set from the course website has been unzipped into the same directory as the notebook.
There are 1997 training images, and 998 each test and training images.

In [12]:
import numpy as np

# Read in training data and labels

# Some useful parsing functions

# male/female -> 0/1
def parseSexLabel(string):
    if (string.startswith('male')):
        return 0
    if (string.startswith('female')):
        return 1
    print("ERROR parsing sex from " + string)

# child/teen/adult/senior -> 0/1/2/3
def parseAgeLabel(string):
    if (string.startswith('child')):
        return 0
    if (string.startswith('teen')):
        return 1
    if (string.startswith('adult')):
        return 2
    if (string.startswith('senior')):
        return 3
    print("ERROR parsing age from " + string)

# serious/smiling -> 0/1
def parseExpLabel(string):
    if (string.startswith('serious')):
        return 0
    if (string.startswith('smiling') or string.startswith('funny')):
        return 1
    print("ERROR parsing expression from " + string)

# Count number of training instances

numTraining = 0

for line in open ("MITFaces/faceDR"):
    numTraining += 1

dimensions = 128*128

trainingFaces = np.zeros([numTraining,dimensions])
trainingSexLabels = np.zeros(numTraining) # Sex - 0 = male; 1 = female
trainingAgeLabels = np.zeros(numTraining) # Age - 0 = child; 1 = teen; 2 = male 
trainingExpLabels = np.zeros(numTraining) # Expression - 0 = serious; 1 = smiling

index = 0
for line in open ("MITFaces/faceDR"):
    # Parse the label data
    parts = line.split()
    trainingSexLabels[index] = parseSexLabel(parts[2])
    trainingAgeLabels[index] = parseAgeLabel(parts[4])
    trainingExpLabels[index] = parseExpLabel(parts[8])
    # Read in the face
    fileName = "MITFaces/rawdata/" + parts[0]
    fileIn = open(fileName, 'rb')
    trainingFaces[index,:] = np.fromfile(fileIn, dtype=np.uint8,count=dimensions)/255.0
    fileIn.close()
    # And move along
    index += 1

# Count number of validation/testing instances

numValidation = 0
numTesting = 0

# Assume they're all Validation
for line in open ("MITFaces/faceDS"):
    numValidation += 1

# And make half of them testing
numTesting = int(numValidation/2)
numValidation -= numTesting

validationFaces = np.zeros([numValidation,dimensions])
validationSexLabels = np.zeros(numValidation) # Sex - 0 = male; 1 = female
validationAgeLabels = np.zeros(numValidation) # Age - 0 = child; 1 = teen; 2 = male 
validationExpLabels = np.zeros(numValidation) # Expression - 0 = serious; 1 = smiling

testingFaces = np.zeros([numTesting,dimensions])
testingSexLabels = np.zeros(numTesting) # Sex - 0 = male; 1 = female
testingAgeLabels = np.zeros(numTesting) # Age - 0 = child; 1 = teen; 2 = male 
testingExpLabels = np.zeros(numTesting) # Expression - 0 = serious; 1 = smiling

index = 0
for line in open ("MITFaces/faceDS"):
    # Parse the label data
    if (index < numTesting):
        testingSexLabels[index] = parseSexLabel(parts[2])
        testingAgeLabels[index] = parseAgeLabel(parts[4])
        testingExpLabels[index] = parseExpLabel(parts[8])
        # Read in the face
        fileName = "MITFaces/rawdata/" + parts[0]
        fileIn = open(fileName, 'rb')
        testingFaces[index,:] = np.fromfile(fileIn, dtype=np.uint8,count=dimensions)/255.0
        fileIn.close()
    else:
        vIndex = index - numTesting
        validationSexLabels[vIndex] = parseSexLabel(parts[2])
        validationAgeLabels[vIndex] = parseAgeLabel(parts[4])
        validationExpLabels[vIndex] = parseExpLabel(parts[8])
        # Read in the face
        fileName = "MITFaces/rawdata/" + parts[0]
        fileIn = open(fileName, 'rb')
        validationFaces[vIndex,:] = np.fromfile(fileIn, dtype=np.uint8,count=dimensions)/255.0
        fileIn.close()
        
    # And move along
    index += 1

# My work

First, I had to clean the data (MITFaces/faceDS, MITFaces/faceDR), because the parsing data above didn't work for lines such as `1232 (_missing descriptor)`. I got an index out of bound error `testingAgeLabels[index] = parseAgeLabel(parts[4])`. To remedy this, I simply removed the entries from the data file which caused the error. This meant I didn't have to mess around with the already provided, good, code.

I then decided for the first part of the assignment, which was gender classification, to use a random forest. The Sklearn RandomForest implementation is a bagging technique for decision trees. The Random Forest algorithm uses a voting system of a set of trees built by training on different samples of the training population. Decision trees themselves are built using a greedy algorithm which selects optimal split points on a set of data to classify data based on its features.

I used sklearn, and tensorflow for my implementations

In [8]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()
rfc.fit(trainingFaces, trainingSexLabels)

print("-"*30)
print("Random forest on sex classification\n")
print("Accuracy on training set: {}%".format(rfc.score(trainingFaces, trainingSexLabels)*100))
print("Accuracy on validation set: {}%".format(rfc.score(validationFaces, validationSexLabels)*100))
print("-"*30)

------------------------------
Random forest on sex classification

Accuracy on training set: 99.29894842263394%
Accuracy on validation set: 100.0%
------------------------------


Then, random forest to classify age groups:

In [9]:
rfc = RandomForestClassifier()
rfc.fit(trainingFaces, trainingAgeLabels)

print("-"*30)
print("Random forest on age classification\n")
print("Accuracy on training set: {}%".format(rfc.score(trainingFaces, trainingAgeLabels)*100))
print("Accuracy on validation set: {}%".format(rfc.score(validationFaces, validationAgeLabels)*100))
print("-"*30)

------------------------------
Random forest on age classification

Accuracy on training set: 99.24887330996495%
Accuracy on validation set: 100.0%
------------------------------


And then for expressions:

In [10]:
rfc = RandomForestClassifier()
rfc.fit(trainingFaces, trainingExpLabels)

print("-"*30)
print("Random forest on expression classification\n")
print("Accuracy on training set: {}%".format(rfc.score(trainingFaces, trainingExpLabels)*100))
print("Accuracy on validation set: {}%".format(rfc.score(validationFaces, validationExpLabels)*100))
print("-"*30)

------------------------------
Random forest on expression classification

Accuracy on training set: 98.79819729594391%
Accuracy on validation set: 100.0%
------------------------------


I was very surprised at how well the Random Forest algorithm performed on this dataset. It gets very close to 100% accuracy most times. 



# Convolutional Neural Network Solution


## Network


This CNN is based off the TensorFlow Keras CNN for basic MNIST found at `https://github.com/keras-team/keras/blob/master/examples/mnist_cnn.py`. I was not sure about the exact dimensions of the image for resizing for the CNN, so I ended up reshaping to a 1997 x 4 x 64 x 64 image as input. I decided to choose this network because the task of MNist is similar to the tasks given, and it is a network to establish a baseline for better networks.

2D Convolution Layers

In [18]:
# Imports (these take ages)

import tensorflow as tf
from tensorflow import keras as keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Dropout,Conv1D,Conv2D,MaxPooling1D,MaxPooling2D,Flatten


In [39]:

def runCNN(x_train, y_train, x_test, y_test):
    num_classes = len(set(y_train))
    print("Num classes:", num_classes)
    y_train = keras.utils.to_categorical(y_train, num_classes=num_classes)
    y_test = keras.utils.to_categorical(y_test, num_classes=num_classes)

    x_train = x_train.reshape(len(x_train), 64, 64, 4)
    x_test = x_test.reshape(len(x_test), 64, 64, 4)

    input_shape = x_train.shape[1:]

    model = Sequential()
    model.add(Conv2D(32, kernel_size=(2, 3),
                     activation='relu',
                     input_shape=input_shape))
    model.add(Conv2D(64, (2, 3), activation='relu'))
    model.add(MaxPooling2D(pool_size=(2, 2)))
    model.add(Dropout(0.25))
    model.add(Flatten())
    model.add(Dense(128, activation='relu'))
    model.add(Dropout(0.5))
    model.add(Dense(num_classes, activation='softmax'))
    
    model.compile(loss=keras.losses.categorical_crossentropy,
                  optimizer=keras.optimizers.Adadelta(),
                  metrics=['accuracy'])

    model.fit(x_train, y_train,
              epochs=10, 
              verbose=1,
              batch_size=128)

    score = model.evaluate(x_test, y_test, verbose=0)
    trainingScore = model.evaluate(x_train, y_train, verbose=0)
#     print('Test loss:', score[0], ' (categorical crossentropy)')
    print('Training accuracy: {}%'.format(trainingScore[1]*100))
    print('Test accuracy: {}%'.format(score[1]*100));
    
    print()
    print('Example prediction', x_train[0], y_train, model.predict([x_train[0]]))

## Sex Labels

In [None]:
runCNN(x_train=trainingFaces, y_train=trainingSexLabels, x_test=validationFaces, y_test=validationSexLabels)

Num classes: 2
Epoch 1/10
Epoch 2/10
Epoch 3/10
 256/1997 [==>...........................] - ETA: 26s - loss: 0.6471 - acc: 0.6250

This should get close to 100% accuracy every time on the test data, for this task, which is really good. 35 epochs is a lot for a CNN though, especially since MNist can be solved in under 12 epochs with a similar network and has 60000 images. 
Sometimes this program slows down my machine so much that it can't function (before training), and I suspect that it could be something to do with the memory allocation required when reshaping the faces.
If this happens, please run this code in another environment.

## Age Labels

This just uses the same CNN as the above, but sets the age labels.

In [36]:
runCNN(x_train=trainingFaces, y_train=trainingAgeLabels, x_test=validationFaces, y_test=validationAgeLabels)

Num classes: 4
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Training accuracy: 81.47220831396106%
Test accuracy: 0.0%



ValueError: Error when checking input: expected conv2d_22_input to have 4 dimensions, but got array with shape (4, 64, 64)

## Expression Labels

In [29]:
runCNN(x_train=trainingFaces, y_train=trainingExpLabels, x_test=validationFaces, y_test=validationExpLabels)

Num classes: 2
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Training accuracy: 77.56634953025585%
Test accuracy: 100.0%



ValueError: Error when checking input: expected conv2d_12_input to have 4 dimensions, but got array with shape (16384, 1)

In [None]:
# This function will be used to evaluate your submission.

def computeLabels(faceData):
    n, d = faceData.shape
    # Zero arrays for the labels, should be able to do better than this
    estSexLabels = np.zeros(n)
    estAgeLabels = np.zeros(n)
    estExpLabels = np.zeros(n)
    return estSexLabels, estAgeLabels, estExpLabels

estS, estA, estE = computeLabels(testData)
# I'll do stuff with the above to evaluate the accuracy of your methods