# Solution writeup for the self-driving project

This writup includes my thoughts and answers to the questions in the assignment, and I also describe the working process here.

## Software setup

I am working on 2015 MacBook Pro with 2.7GHz dual-core Intel Core i5 processor, on MacOS Monterey.

After installing the simulator on my Mac, I couldn't run it because of security issues with MontereyOS. The problem was solved after changing OS security settings and after running `chmod 755` on the binary file.

Also, pytorch version 1.8.2 is not available, so the experiments were run on pytorch 1.8.1 instead.

## Reproducibility

The experiments were run on the default data given in the assignment. I used `np.random.seed(42)` in `train.py` to remove randomness due to batch splits.

# Track 1 performace

When running `model.pt` on Track 1 (we can run it forward and backward on a track), there are several issues with the driving:
- Jerking steering motions and low speed.
- The car does not follow the road when there is a fork with a dirt road (see the picture).

<img src="images/center_2022_08_20_15_08_22_902.jpg" alt="picture" width="400"/>

- After getting out of bounds of the road, the car does not return back on track.


## Ways to improve the performance without collecting new data

There are many modifications we can try here.

### Adjusting control

I modified `drive.py` to scale the steering angle (the changed file is in the repository). Setting `steering_angle_alpha = 0.2` and `set_speed = 15` contributed a lot to the smoothness and speed of motion.

### Adjusting the training model

The initial model (`model_default.py` in the repository) has too many parameters for 20k training images. Although it can produce good results after a small number of training epochs, it quickly overfits the data if we let it run for longer. The overfitted model does not generalize well on new data, making it a bad model for self-driving.

We can fight overfitting with a leaner model. Consider two models in the repository: `model_lean.py` and `model_lean_2.py`. Although they are very similar, `model_lean_2` shows signs of overfitting: if we look at train vs val loss, the val loss plateaus after 15 epochs, while train loss is going down:

<img src="model_lean_2.png" alt="picture" width="500"/>

However, `model_lean.py` does not obtain a desired loss of 0.8 of the `model_default.py`, and the loss only goes down to 0.91:

<img src="model_lean.png" alt="picture" width="500"/>

Also note that `model_lean.py` contains about 1k parameters, a comparable number to 20k datapoints, as opposed to 2M parameters in `model_default.py`.

When running `model_lean.pt` on the simulator (with `steering_angle_alpha = 0.05` and `set_speed = 10` in `drive.py`), the car drives reasonably well, except for the remaining issue with forks on the road and except for driving on a bridge:

<img src="images/center_2022_08_20_15_08_08_344.jpg" alt="picture" width="400"/>

The issue with bad control on a bridge is most likely caused by a low number of filters in `model_lean.py`, which does not allow the model to train for this specific case on a small portion of bridge training samples.

### Data augmentation

Another way to reduce overfitting and to help the model generalize is to generate more data samples from existing data.

I implemented a new data class `UdacityAugmented` in `data.py` which flips the image from left to right and inverts the desired steering value. Unfortunately, the augmented data doesn't yield a better model performance, so I used the standard data class instead.

### Other ideas

There are several more ideas we can try:

1. We can transform our control task into a classification problem with three classes: drive straight, turn left or turn right, which should greatly help with generalization.
<br/><br/>
1. The simulator also exports images from the left and right cameras of the car, thus we could add them to the training data as extra channels.
<br/><br/>
1. We can add regularization to the neural net to decrease overfitting. We can also try different activation functions or adding batchnorm or pooling layers to the deep net.
<br/><br/>
1. We can manually (or automatically, with a pre-trained CNN) label certain training images as 'important' (e.g. images with a fork on the road or images on a bridge). We can then balance our training dataset to include more copies of important images, or we can increase the loss from important images during training.
<br/><br/>
1. We could augment image data with the optical flow (possibly normalized with respect to speed). This way our model would be aware of the dynamic motion of the keypoints and would make better decisions for steering.
<br/><br/>
1. Additionally to the optical flow, we can perform 3D reconstruction of the scene, and train on the 3D scenes.
<br/><br/>
1. It would be more natural to infer the position of the car and its angle with respect to the road from the image that we see, and only then use a simple controller (or RL) to decide the control. To make it possible, we need to build an API in our simulator which provides real-time position of the car with respect to the road.


# Track 2 performance

The model trained on Track 1 does not work for Track 2. There could be several reasons for that:
- Track 1 has distinct road boundaries, while Track 2 does not.
- Track 2 has different road texture.
- Track 2 has uphill-downhill parts of the road and large shadows, while Track 1 does not.

To summarize, the representations which our model learned on Track 1, like road boundaries or road texture, do not transfer to Track 2. The problem to learn robust representations which can be used in another task is called transfer learning. The term 'domain adaptation' is also used for the same task in different environments.

## Possible solutions
1. A simple solution to this particular case is to train our model on data from both tracks. However, this would not necessarily give us a model which would do well on some unknown Track 3.
<br/><br/>
1. We can apply the technique called domain randomization. Inside the engine, we can apply random textures to the road and the background, put random objects around the road or generate random pieces of the track. Then we collect the data and train the model on images in randomized settings. This way, the model learns much more general representations.
<br/><br/>
1. There are domain adaptation methods based on domain-invariant feature learning. One such method is based on learning feature embedding for which feature distributions for images from Track 1 and Track 2 are similar. To implement this we would add an additional term to our loss function which measures the divergence between feature distributions of two domains.
<br/><br/>
1. Another way to learn domain-invariant features is to set up an adversarial discriminator network which is trained to discriminate between domains (Track 1 and Track 2) based on the feature vector. We then need to put an additional loss term to our main network which penalizes the correct discrimination so that the learned features become indistinguishable between domains.
<br/><br/>
1. We can also apply instance segmentation networks to assign the labels like 'road' or 'guard rail' to the pixels, and run our CNN on the label data instead of RGB data (we probably need to add optical flow there as well). That way, the network is learning on semantic information, which makes the learning more transferable.

# Model evaluation

While steering angle mse loss gives an indication of how the self-driving car would behave for the common cases, it does not show the robustness of our model for corner cases.

Examining the model visually shows the behavior of the car for corner cases, but it takes too much time to validate the driving on the whole track. Ideally, we want the validation to be automatic, without human supervision. Also, the visual observation does not provide any numerical measure of model performance.

Here are some options of how to measure the performance of our model:
1. Whenever the car is out of bounds or is stuck, we can restart it back a bit further on the track (a simulator API with a restart button would be helpful here). Then a good measure of performance would be the number of restarts we used for one lap of the track.
<br/><br/>
1. To simplify the validation method, we can build an API in our simulator which automatically runs the car on the track and performs restarts. The simulated run could be much faster then real time, and we can launch several validation experiments simultaneously.
<br/><br/>
1. Instead of counting restarts across the whole track, we can set up 10-20 episodic 'challanges' for the car. Each episodic run is about 20 seconds long, for each episode the car starts from a certain position on the track and tries to follow the road from there. We then count how many challenges the car has failed (by either getting off track or being stuck), and we take that as a measure of performance.
<br/><br/>
1. In addition to discrete measures above, we can indroduce continous measures of performance. One simple measure is how much time it takes for the car to complete a lap (or, in case of episodic runs, time to complete the episodes), which measures the speed of the car. Another option is the integral of square of steering angle over the lap run (or over the episodes), which measures how stable our driving model is.

Altogether, we can write our validation score in episodic setting as
$$Score = -failed\_episodes - \lambda_1 \sum_{episodes} \int_{episode} steering\_angle^2\ dt - \lambda_2 \sum_{episodes} time\_to\_complete\_episode.$$

# Discussion questions

### Is this learning problem supervised or unsupervised? Is it a reinforcement learning problem?

This particular learning problem is a supervised learning problem, without any reinforcement learning.
However, there are ways to set up a reinforcement learning problem for this self-driving setting to make the driving policy more robust to corner cases.

Suppose we have a CNN which learned from the previous supervized learning setting to output steering wheel angles. We will use that network to initialize our RL policy: we adjust the last scalar product layer in our CNN to output expected reward for tree actions: left by pre-defined angle alpha, straight and right by angle alpha. The higher angle the old CNN predicted, the higher reward there would be in the corresponding direction, and lower reward for other options.

For the second stage of training, as a reward function we can use for example the Score funciton from the previous section, and train our network to find expected rewards for three actions. The resulting network would be able to consider long-term behavior and policy for the car, and would be able to predict the failures in the corner cases.

### Why does a PID controller is used? Could it be possible to do without?

The PID controller is used to set a stable predetermined speed for the car, which simplifies the control problem. If we don't optimize for time, the PID controller for speed should be sufficient (except for maybe uphill and downhill sections on Track 2).

If we want to optimize for time, we'd like to learn optimal speed policy for the car. We can set up the CNN to jointly learn the steering angle and the desired speed based on the image. The output of the network would then be two values, and the loss function would be the weighted sum of two MSEs.

### In general, can you discuss the influence of training with more data on model performance? What about training with potentially less data but selecting the most informative samples?

In general, more data for CNN decreases overfitting and thus allows to have more parameters in the deep net. However, if the dataset is not balanced, the network might not learn on under-represented classes.

For the problem of self-driving car policy, we can see that there are many corner cases which make small fraction of the collected data.
Thus, our network should pay special attention to the corner cases, and picking a subset of data with most informative samples would help with that. Another way we can reinforce the special cases would be to duplicate the corresponding images in the dataset, or to increase the weight for the corresponding MSE loss.

As mentioned before, we might set up a pre-trained CNN to automatically determine the informative and unusual samples. For example, we can import ResNet, look at the feature vectors for the collected images, and pick a subset of vectors which are more-or-less uniformly distributed across the feature space. This would give us a good subset of diverse images for training.

### Would you classify this project as deterministic or non deterministic? Why? What is the impact on the development of a driving model?

This project is nondeterministic due to randomness in batch splits (in the experiments I removed the randomness for reproducibility). Usually there is also randomness due to weight initialization, which was removed by setting `torch.manual_seed(42)` in `train.py` in this implementation.

The non-determinism in this project means that we could get a good driving model even for over-parametrized CNN if we get lucky data splits after 10 epochs. On the other hand this approach is unstable, since it is hard to predict whether the model would learn good representations just based on the model.

### According to you, what are the most relevant or impressive research papers in the literature related to this project? Cite a few and explain briefly why you selected them.

For the most impressive research I would list a couple of semi-recent papers from on domain adaptation:
1. Using Simulation and Domain Adaptation to Improve Efficiency of Deep Robotic Grasping, Bousmalis at. al. 2018 (https://arxiv.org/pdf/1709.07857.pdf).
1. CyCADA: Cycle-Consistent Adversarial Domain Adaptation, Hoffman et. al. 2018 (http://proceedings.mlr.press/v80/hoffman18a/hoffman18a.pdf).

Those papers show quite good results on Sim-to-Real transfer for robot grasping and optical character recognition tasks. They both utilize domain discriminator net to reinforce the learning of domain-invariant features.

For the most relevant research direction, I would mention an application of vision transformers to domain adaptation:
1. TVT: Transferable Vision Transformer for Unsupervised Domain Adaptation, Yang et. al. 2021 (https://arxiv.org/pdf/2108.05988.pdf).

The paper explores transfer learning capabilities of ViTs, and show that ViT-based architectures can greatly outperform transferability of CNN-based networks. 