In [5]:
# %%html
# <!-- This cell makes the font bigger to make it easy to read. Adjust to taste -->
# <style>
# .cell, .CodeMirror pre{ 
#     font-size: 150%;
#     line-height: 125%;
# }
# </style>

# COSC470 Assignment 2, 2018

## Name: YOUR NAME HERE
## Due Date: Monday September 24th

For assignment 2 you need to implement machine learning algorithm(s) to label faces according to:
- sex (male/female)
- age (child/teen/adult/senior)
- expression (smiling/serious)

A data set from MIT is made available, along with code to read the images and labels into `numpy` arrays. 
These arrays are divided into training, validation, and testing data sets.

You may use any machine learning algorithms you like to classify the faces.
Techniques you may find useful that we've looked at include:
- Decision trees and random forests
- Boosting (and AdaBoost in particular)
- Support Vector Machines (SVMs)
- Face detection (to focus on the key parts of the image)
- EigenFaces (for dimensionality reduction)
- Neural networks in TensorFlow
- CNNs in TensorFlow

## Submission Requirements

You should submit a version of this Notebook renamed to `YourName.ipynb`, so my submission would be `StevenMills.ipynb`. 
You can assume that the same libraries that are available in the COSC470 Anaconda environment on the lab machines are available.
In particular, you can use numpy, scipy, OpenCV, and TensorFlow.

I should be able to open your Notebook and run it. The Notebook should contain the code to construct and train your classifier(s) from the training data (using the validation data appropriately) and then to compute the labels of the training data through a call to `computeLabels`, which has a stub implementation at the end of this notebook. 

## Marking Scheme

A rough marking scheme is given below. This is intentionally fairly open, so that I can give you marks for doing good stuff without having to predetermine what stuff is good.

- 10 marks for the discussion of choice of algorithms and training strategy
- 10 marks for the explanation and clear implementation
- 5 marks for performance

### Algorithm Choice and Training

I will be looking for a description of the algorithm(s) chosen, why you chose that approach, and how you developed, trained and evaluated your method.
You should think about issues such as how to best make use of the training and validation data and how to select parameters for your chosen method.

You are not restricted to a single classifier or method. If you find it useful to determine age labels first and then use that to help determine expression, then that is fine. If you want to use an SVM for sex classification, but a boosted classifier for age, that's also fine.
However, you should discuss why you chose to use the methods you have chosen.

### Explanation and Clear Implementation

You should implement your chosen algorithm(s) using the training and validation data sets provided. 
Jupyter notebooks let you interleave discussion and code, so you should clearly describe how your implementation works.
You can include mathematics if needed using \\(\LaTeX\\)-style markup as demonstrated in the lecture notebooks.
I'll be looking for clear implementations that illustrate good practice in training and evaluation.

It is expected that you will make appropriate use of libraries such as OpenCV and TensorFlow where appropriate, but your explanation should your understanding of these tools clear. 
For example, if you choose to use a convolutional network, you should explain your architecture, how it relates to the code, and give some justification for the various parameters that you need to select when making a CNN.

### Performance

The last cell of the notebook has a function that takes a face data set and produces labels as a result.
You should modify this so that it uses your machine learning algorithms to generate the labels.
I will then use these labels to compare your results to the ground truth.
I may also shuffle the training, validation, and testing data sets around before running your code.

# The Data Set

The following code reads the data into training, testing, and validation sets.
It assumes that the `.zip` of labelled face data set from the course website has been unzipped into the same directory as the notebook.
There are 1997 training images, and 998 each test and training images.

In [6]:
import numpy as np


# Read in training data and labels

# Some useful parsing functions

# male/female -> 0/1
def parseSexLabel(string):
    if (string.startswith('male')):
        return 0
    if (string.startswith('female')):
        return 1
    print("ERROR parsing sex from " + string)


# child/teen/adult/senior -> 0/1/2/3
def parseAgeLabel(string):
    if (string.startswith('child')):
        return 0
    if (string.startswith('teen')):
        return 1
    if (string.startswith('adult')):
        return 2
    if (string.startswith('senior')):
        return 3
    print("ERROR parsing age from " + string)


# serious/smiling -> 0/1
def parseExpLabel(string):
    if (string.startswith('serious')):
        return 0
    if (string.startswith('smiling') or string.startswith('funny')):
        return 1
    print("ERROR parsing expression from " + string)


# Count number of training instances

numTraining = 0

for line in open("MITFaces/faceDR"):
    if line.find('_missing descriptor') < 0:
        numTraining += 1

dimensions = 128 * 128

trainingFaces = np.zeros([numTraining, dimensions])
trainingSexLabels = np.zeros(numTraining)  # Sex - 0 = male; 1 = female
trainingAgeLabels = np.zeros(numTraining)  # Age - 0 = child; 1 = teen; 2 = male
trainingExpLabels = np.zeros(numTraining)  # Expression - 0 = serious; 1 = smiling

index = 0
for line in open("MITFaces/faceDR"):
    if line.find('_missing descriptor') >= 0:
        continue
    # Parse the label data
    parts = line.split()
    trainingSexLabels[index] = parseSexLabel(parts[2])
    trainingAgeLabels[index] = parseAgeLabel(parts[4])
    trainingExpLabels[index] = parseExpLabel(parts[8])
    # Read in the face
    fileName = "MITFaces/rawdata/" + parts[0]
    fileIn = open(fileName, 'rb')
    trainingFaces[index, :] = np.fromfile(fileIn, dtype=np.uint8, count=dimensions) / 255.0
    fileIn.close()
    # And move along
    index += 1

# Count number of validation/testing instances

numValidation = 0
numTesting = 0

# Assume they're all Validation
for line in open("MITFaces/faceDS"):
    if line.find('_missing descriptor') < 0:
        numTraining += 1
    numValidation += 1

# And make half of them testing
numTesting = int(numValidation / 2)
numValidation -= numTesting

validationFaces = np.zeros([numValidation, dimensions])
validationSexLabels = np.zeros(numValidation)  # Sex - 0 = male; 1 = female
validationAgeLabels = np.zeros(numValidation)  # Age - 0 = child; 1 = teen; 2 = male
validationExpLabels = np.zeros(numValidation)  # Expression - 0 = serious; 1 = smiling

testingFaces = np.zeros([numTesting, dimensions])
testingSexLabels = np.zeros(numTesting)  # Sex - 0 = male; 1 = female
testingAgeLabels = np.zeros(numTesting)  # Age - 0 = child; 1 = teen; 2 = male
testingExpLabels = np.zeros(numTesting)  # Expression - 0 = serious; 1 = smiling

index = 0
for line in open("MITFaces/faceDS"):
    if line.find('_missing descriptor') >= 0:
        continue

    # Parse the label data
    parts = line.split()
    if (index < numTesting):
        testingSexLabels[index] = parseSexLabel(parts[2])
        testingAgeLabels[index] = parseAgeLabel(parts[4])
        testingExpLabels[index] = parseExpLabel(parts[8])
        # Read in the face
        fileName = "MITFaces/rawdata/" + parts[0]
        fileIn = open(fileName, 'rb')
        testingFaces[index, :] = np.fromfile(fileIn, dtype=np.uint8, count=dimensions) / 255.0
        fileIn.close()
    else:
        vIndex = index - numTesting
        validationSexLabels[vIndex] = parseSexLabel(parts[2])
        validationAgeLabels[vIndex] = parseAgeLabel(parts[4])
        validationExpLabels[vIndex] = parseExpLabel(parts[8])
        # Read in the face
        fileName = "MITFaces/rawdata/" + parts[0]
        fileIn = open(fileName, 'rb')
        validationFaces[vIndex, :] = np.fromfile(fileIn, dtype=np.uint8, count=dimensions) / 255.0
        fileIn.close()

    # And move along
    index += 1

# YOUR WORK GOES HERE...

This cell would be  good place to start adding your own work. 
With Jupyter notebooks you can mix descriptive cells like this one, which use *Markdown* to do simple formatting with code cells (Python in this instance) like the cells above and below this one.

In [7]:
# This function will be used to evaluate your submission.

def computeLabels(faceData):
    n, d = faceData.shape
    # Zero arrays for the labels, should be able to do better than this
    estSexLabels = np.zeros(n)
    estAgeLabels = np.zeros(n)
    estExpLabels = np.zeros(n)
    return estSexLabels, estAgeLabels, estExpLabels

estS, estA, estE = computeLabels(validationFaces)
# I'll do stuff with the above to evaluate the accuracy of your methods