### Description of the project

In this project the task was to label individual pixels of images and output an entire image instead of just a classification. This was conducted with the help of a fully connected neural network or FCN for semantic segmentation. Semantic segmentation indentifies free space on the road at pixel-level granularity. The ultimate goal of this more detailed approach is to alow better decision making processes for driverless cars. In this project only two labels existed in which the pixels were categorized - road and no road. With more training data for various labels such as pedestrians, sidewalks, traffic lights, etc. this approach could be used to analyze pictures in great detail and help the diverless car to better understand complex environments. The following picture illustrates how this could look like.

![Example for more labels](./images/semseg_example.png)

Due to the high complexity and need for a lot of computational power, the project only focuses on the pixels being part of the road or not. Following a picture from the *kitti road* training set with the corresponding ground truth image and an overlay to see how it comes together.

![training_sample](./images/training_sample.png)
![training_sample_gt](./images/training_sample_gt.png)
![training_sample_overlay](./images/training_sample_overlay.png)

The pictures above show three labels in the ground truth image:
* red - no road
* pink - road the vehicle is driving on (both directions)
* black - other roads
But we are only focusing on two labels for now (road and no road). Therefore, the data was manipulated in the function *get_batches_fn* in *helper.py* before it was read into a numpy array. The red background was labeled as "no road" and everything else (the inverted part, that contains the pink and black areas) was labeled as "road" pixels.


In order to achieve results faster a pretrained and frozen VGG16 model was used as the base for the FCN. The goal of this project was to encode a picture, learn from it on a pixle level, and then decode the information into a new picture again. This was achieved with the help of a pretrained model and a 1x1 convolution as the encoder and transposed convolutions as the decoder to upsample the image data back to the original format.

![General structure of the FCN](./images/FCN_structure.png)

The goal of this project was to take the frozen VGG16 and add skip connections, 1x1 convolutions, and transposed convolutions. I therefore added skip connections to the VGG layers 3, 4, and 7 and added a 1x1 convolutional layer to each skipped connection. After that I added upsampling (transposed convolutions) to the output of the highest layer (convolution on layer 7) and added it (element-wise) to the output of the convolution on the next highest layer (layer 4). Then I took the result and did the same with the output from layer 3. A complete structure of the FCN can be seen in the picture below, which is a graph I visualized with the help of TensorBoard. The green box shows the added layers to the pretrained VGG16.

![Graph of overall network structure.](./images/complete_network_graph.png)

The following section shows the program code of *main.py* divided up into different sections with additional explanation. The results are shown and explained below the program code to show the improvements that were made while developing the code.

### Import all necessary dependencies and checking tensorflow and GPU support

Here I imported all necessary dependencies and made sure that the TensorFlow version was coorect and the GPU was found.

In [1]:
import os.path
import tensorflow as tf
import helper
import warnings
from distutils.version import LooseVersion
import project_tests as tests

import numpy as np
import random

# Check TensorFlow Version
assert LooseVersion(tf.__version__) >= LooseVersion('1.0'), 'Please use TensorFlow version 1.0 or newer.  You are using {}'.format(tf.__version__)
print('TensorFlow Version: {}'.format(tf.__version__))

# Check for a GPU
if not tf.test.gpu_device_name():
    warnings.warn('No GPU found. Please use a GPU to train your neural network.')
else:
    print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))

TensorFlow Version: 1.6.0
Default GPU Device: /device:GPU:0


### Load pretrained VGG model into TensorFlow

Next, the function for loading the pretrained VGG16 model was defined.

In [2]:
def load_vgg(sess, vgg_path):
    """
    Load Pretrained VGG Model into TensorFlow.
    :param sess: TensorFlow Session
    :param vgg_path: Path to vgg folder, containing "variables/" and "saved_model.pb"
    :return: Tuple of Tensors from VGG model (image_input, keep_prob, layer3_out, layer4_out, layer7_out)
    """
    # TODO: Implement function
    #   Use tf.saved_model.loader.load to load the model and weights
    vgg_tag = 'vgg16'
    vgg_input_tensor_name = 'image_input:0'
    vgg_keep_prob_tensor_name = 'keep_prob:0'
    vgg_layer3_out_tensor_name = 'layer3_out:0'
    vgg_layer4_out_tensor_name = 'layer4_out:0'
    vgg_layer7_out_tensor_name = 'layer7_out:0'

    tf.saved_model.loader.load(sess, [vgg_tag], vgg_path)
    graph = tf.get_default_graph()
    w1 = graph.get_tensor_by_name(vgg_input_tensor_name)
    keep = graph.get_tensor_by_name(vgg_keep_prob_tensor_name)
    layer_3 = graph.get_tensor_by_name(vgg_layer3_out_tensor_name)
    layer_4 = graph.get_tensor_by_name(vgg_layer4_out_tensor_name)
    layer_7 = graph.get_tensor_by_name(vgg_layer7_out_tensor_name)
    
    return w1, keep, layer_3, layer_4, layer_7

The function was seperately tested with the already implemented test function in *project_tests.py*.

In [3]:
tests.test_load_vgg(load_vgg, tf)

Tests Passed


### Create the layers for a fully convolutional network

Following I had to implement the function *layers()* where the 1x1 convolution and the upsampling of the different layers was implemented as described before. I used a kernel regularizer on every layer as recommended in the course.

In [4]:
def layers(vgg_layer3_out, vgg_layer4_out, vgg_layer7_out, num_classes):
    """
    Create the layers for a fully convolutional network.  Build skip-layers using the vgg layers.
    :param vgg_layer3_out: TF Tensor for VGG Layer 3 output
    :param vgg_layer4_out: TF Tensor for VGG Layer 4 output
    :param vgg_layer7_out: TF Tensor for VGG Layer 7 output
    :param num_classes: Number of classes to classify
    :return: The Tensor for the last layer of output
    """
    # TODO: Implement function
    # 1x1 convolution of vgg layer 7
    layer7a_out = tf.layers.conv2d(vgg_layer7_out, num_classes, 1, 
                                   padding= 'same', 
                                   kernel_initializer= tf.random_normal_initializer(stddev=0.01),
                                   kernel_regularizer= tf.contrib.layers.l2_regularizer(1e-3))
    # upsample
    layer4a_in1 = tf.layers.conv2d_transpose(layer7a_out, num_classes, 4, 
                                             strides= (2, 2), 
                                             padding= 'same', 
                                             kernel_initializer= tf.random_normal_initializer(stddev=0.01), 
                                             kernel_regularizer= tf.contrib.layers.l2_regularizer(1e-3))
    # make sure the shapes are the same!
    # 1x1 convolution of vgg layer 4
    layer4a_in2 = tf.layers.conv2d(vgg_layer4_out, num_classes, 1, 
                                   padding= 'same', 
                                   kernel_initializer= tf.random_normal_initializer(stddev=0.01), 
                                   kernel_regularizer= tf.contrib.layers.l2_regularizer(1e-3))
    # element-wise addition
    layer4a_out = tf.add(layer4a_in1, layer4a_in2)
    # upsample
    layer3a_in1 = tf.layers.conv2d_transpose(layer4a_out, num_classes, 4,  
                                             strides= (2, 2), 
                                             padding= 'same', 
                                             kernel_initializer= tf.random_normal_initializer(stddev=0.01), 
                                             kernel_regularizer= tf.contrib.layers.l2_regularizer(1e-3))
    # 1x1 convolution of vgg layer 3
    layer3a_in2 = tf.layers.conv2d(vgg_layer3_out, num_classes, 1, 
                                   padding= 'same', 
                                   kernel_initializer= tf.random_normal_initializer(stddev=0.01), 
                                   kernel_regularizer= tf.contrib.layers.l2_regularizer(1e-3))
    # element-wise addition
    layer3a_out = tf.add(layer3a_in1, layer3a_in2)
    # upsample
    nn_last_layer = tf.layers.conv2d_transpose(layer3a_out, num_classes, 16,  
                                               strides= (8, 8), 
                                               padding= 'same', 
                                               kernel_initializer= tf.random_normal_initializer(stddev=0.01), 
                                               kernel_regularizer= tf.contrib.layers.l2_regularizer(1e-3))
    return nn_last_layer

The function was seperately tested with the already implemented test function in project_tests.py.

In [5]:
tests.test_layers(layers)

Tests Passed


### Build the TensorFlow loss and optimizer operations

Then I had to define the function *optimize()*, where the result from *layers()* was taken in order to calculate the loss, the accuracy, and the tyraining operation. I decided to take the cross entropy loss and use an Adam optimizer as this was recommended in the lessons. I also added the loss and accuracy to a summary that I used later for visualizing the performance of the FCN in TensorBoard.

In [6]:
def optimize(nn_last_layer, correct_label, learning_rate, num_classes):
    """
    Build the TensorFlow loss and optimizer operations.
    :param nn_last_layer: TF Tensor of the last layer in the neural network
    :param correct_label: TF Placeholder for the correct label image
    :param learning_rate: TF Placeholder for the learning rate
    :param num_classes: Number of classes to classify
    :return: Tuple of (logits, train_op, cross_entropy_loss)
    """
    # TODO: Implement function
    # make logits a 2D tensor where each row represents a pixel and each column a class
    logits = tf.reshape(nn_last_layer, (-1, num_classes))
    correct_label = tf.reshape(correct_label, (-1,num_classes))

    # define loss function
    with tf.name_scope('cross_entropy'):
        cross_entropy_loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits= logits, labels= correct_label))

    # calculate accuracy
    with tf.name_scope('accuracy'):
        correct_prediction = tf.equal(tf.argmax(logits, 1), tf.argmax(correct_label, 1))
        accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

    # define training operation
    with tf.name_scope('train'):
        optimizer = tf.train.AdamOptimizer(learning_rate= learning_rate)
        train_op = optimizer.minimize(cross_entropy_loss)

    tf.summary.scalar("cost", cross_entropy_loss)
    tf.summary.scalar("accuracy", accuracy)
    

    return logits, train_op, cross_entropy_loss

The function was seperately tested with the already implemented test function in project_tests.py.

In [7]:
tests.test_optimize(optimize)

Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See tf.nn.softmax_cross_entropy_with_logits_v2.

Tests Passed


In [8]:
# add preprocessing steps here

### Train neural network and print out the loss during training

After that I implemented the training process where the FCN was trained on the Kitti road dataset in batches. All summaries were merged and written to a file for later evaluation with TensorBoard.

In [9]:
def train_nn(sess, epochs, batch_size, get_batches_fn, train_op, cross_entropy_loss, input_image,
             correct_label, keep_prob, learning_rate):
    """
    Train neural network and print out the loss during training.
    :param sess: TF Session
    :param epochs: Number of epochs
    :param batch_size: Batch size
    :param get_batches_fn: Function to get batches of training data.  Call using get_batches_fn(batch_size)
    :param train_op: TF Operation to train the neural network
    :param cross_entropy_loss: TF Tensor for the amount of loss
    :param input_image: TF Placeholder for input images
    :param correct_label: TF Placeholder for label images
    :param keep_prob: TF Placeholder for dropout keep probability
    :param learning_rate: TF Placeholder for learning rate
    """

    # TODO: Implement function
    sess.run(tf.global_variables_initializer())

    summary_op = tf.summary.merge_all()
    writer = tf.summary.FileWriter("./writer/TensorBoard", graph=tf.get_default_graph())
    
    print("Training...")

    step = 0
    for i in range(epochs):
        print("EPOCH {} ...".format(i+1))

        for image, label in get_batches_fn(batch_size):         

            _, loss = sess.run([train_op, summary_op], feed_dict={input_image: image, correct_label: label, keep_prob: 0.5, learning_rate: 0.0009})

            writer.add_summary(loss, step)
            
            #tf.summary.image("input", image, 1)
            #tf.summary.image("label", label, 1)
            
            #print("Batch: " + str(step+1))
            step = step + 1
            

            #print("Loss: = {:.3f}".format(loss))

The test function was commented out due to the fact that the modifications for TensorBoard would cause an error.

In [10]:
#tests.test_train_nn(train_nn)

Because the training data was very limited (only 289 pictures) I decided to augment the pictures randomly to increase the robustness of the algorithm. I added the function *modify_picture()* to *helper.py* for that purpose, that was called in the function *get_batches_fn()* when training the neural network in batches  and randomly flipped and rotated the pictures and the corresponding ground truth image before passing the data to the FCN. Following an example picture that was flipped horizontally and then rotated. I limited the rotation to +/- 5 degrees to keep it in a more realistic range.

Original picture:
![Normal picture](./images/training_picture_normal.png)

Horizontally flipped:
![Flipped picture](./images/training_picture_flipped.png)

And rotated:
![Normal picture](./images/training_picture_rotated.png)

### Train neural network

Finally the function *run()* was implemented to put everything together - from loading the pretrained VGG16 and the dataset to training the modified FCN.

In [None]:
def run():

    # reset graph
    tf.reset_default_graph()

    num_classes = 2 # road and no-road
    image_shape = (160, 576)
    data_dir = './data'
    runs_dir = './runs'
    tests.test_for_kitti_dataset(data_dir)

    # Download pretrained vgg model
    helper.maybe_download_pretrained_vgg(data_dir)

    # OPTIONAL: Train and Inference on the cityscapes dataset instead of the Kitti dataset.
    # You'll need a GPU with at least 10 teraFLOPS to train on.
    #  https://www.cityscapes-dataset.com/

    epochs = 25
    batch_size = 10

    training_image_path = os.path.join(data_dir, 'data_road/training/image_2')
    training_image_no = len(os.listdir(training_image_path))

    # tf placeholders
    learning_rate = tf.placeholder(tf.float32, name='learning_rate')
    correct_label = tf.placeholder(tf.int32, [None, None, None, num_classes], name='correct_label')



    with tf.Session() as sess:
        # Path to vgg model
        vgg_path = os.path.join(data_dir, 'vgg')
        # Create function to get batches
        get_batches_fn = helper.gen_batch_function(os.path.join(data_dir, 'data_road/training'), image_shape)

        # load vgg
        vgg_tag = 'vgg16'
        tf.saved_model.loader.load(sess, [vgg_tag], vgg_path)

        # OPTIONAL: Augment Images for better results
        #  https://datascience.stackexchange.com/questions/5224/how-to-prepare-augment-images-for-neural-network

        # TODO: Build NN using load_vgg, layers, and optimize function

        input_image, keep_prob, vgg_layer3_out, vgg_layer4_out, vgg_layer7_out = load_vgg(sess, vgg_path)

        nn_last_layer = layers(vgg_layer3_out, vgg_layer4_out, vgg_layer7_out, num_classes)

        logits, train_op, cross_entropy_loss = optimize(nn_last_layer, correct_label, learning_rate, num_classes)

        # TODO: Train NN using the train_nn function
        train_nn(sess, epochs, batch_size, get_batches_fn, train_op, cross_entropy_loss, input_image, correct_label, keep_prob, learning_rate)

        # TODO: Save inference data using helper.save_inference_samples
        helper.save_inference_samples(runs_dir, data_dir, sess, image_shape, logits, keep_prob, input_image)

        # OPTIONAL: Apply the trained model to a video

        print('All finished.')

if __name__ == '__main__':
    run()

Tests Passed
Pretrained vgg model found.
INFO:tensorflow:Restoring parameters from b'./data\\vgg\\variables\\variables'
INFO:tensorflow:Restoring parameters from b'./data\\vgg\\variables\\variables'
Training...
EPOCH 1 ...
EPOCH 2 ...
EPOCH 3 ...
Training Finished. Saving test images to: ./runs\1523841297.566543


### Results

I trained the FCN in steps to see the progress of the learning along the way. I started with only 1 Epoch, then increased it steadily to up to 25 Epochs. The pictures below show the overlayed prediction of the road pixels on the original image after the different epochs. A lot of change happened during the first few epochs and therefore I chose to show epochs 1,2,3,5 and 25.


After 1 Epoch:
![Picture 1 Epoch](./images/1_epoch.png)

After 2 Epochs:
![Picture 2 Epochs](./images/2_epochs.png)

After 3 Epochs:
![Picture 3 Epochs](./images/3_epochs.png)

After 5 Epochs:
![Picture 5 Epochs](./images/5_epochs.png)

After 25 Epochs:
![Picture 25 Epochs](./images/25_epochs.png)

It can be seen that after the first epoch the prediction was almost a random guess. The red pixels seem to be almost evenly distributed. After the second epoch, most of the pixels were gone and only a few appeared in the center where the algorithm was certain that they belonged to the road. After the third epoch, a lot more pixels show up, but they are not going all the way to sides of the road and also not into the distance. You can clearly see larger steps forming the diagonal limits. It also recognized the pull-off area on the right falsy as a part of the road. After 5 epochs the steps were smoother and it detected into further distance. After 25 epochs it was way smoother, the pull-off area was removed and some parts of the road on the other side were detected.

To show the progress over the 25 epochs, I recorded the accuracy and cost for each batch that was used for training. I trained for 25 Epochs with a batch size of 10. For 289 images that meant 29 batches per epoch or 725 batches in total. The images of the final test with the overlayed detected road pixels were saved as well as the summary of the accuracy and cost for TensorBoard. Following the two graphs:


Accuracy:
![TensorBoard Accuracy](./images/tensorboard_accuracy.png)

Cost:
![TensorBoard Cost](./images/tensorboard_cost.png)

Following a couple of examples where the algorithm did a good job detecting the road pixels:


It detected multiple lanes in both directions and was not distracted by the shadows of the trees:
![Good example 1](./images/good_example_1.png)

It omitted the pixels of the cars that were on the street with good accuracy (there are still some red pixels in the car):
![Good example 2](./images/good_example_2.png)

It recognized that the sidewalk to the right does not belong to the road even though it has a very similar color and shape:
![Good example 3](./images/good_example_3.png)

And here are some examples were the algorithm failed to detect a significant amount of the correct road pixels:


It recognized the sidewalk as part of the road and did not recognized the parts of the road in the shadows:
![Bad example 1](./images/bad_example_1.png)

Again, the shadows were throwing off the road recognition:
![Bad example 2](./images/bad_example_2.png)

The different light conditions of the underpass caused some problems, too:
![Bad example 3](./images/bad_example_3.png)

### Further Optimization

The biggest problem seemed to be hard light and shadows on the street caused by buildings, trees, cars etc. This is understandable as only a few pictures in the training data showed shadows and the algorithm was not able to learn properly from that. In order to achieve better performance the solution could be further tweaked in multiple ways. Additional steps could be added to the image preprocessing such as random brightness adjustments of the whole picture or of parts of the picture (to create artificial shadows). The pictures could also be normalized and transferred into a different color space. But even more important is to train on a larger data set, as only 289 pictures is not enough to cover all the different traffic situations.