<a href="https://colab.research.google.com/github/KeisukeShimokawa/CarND-Advanced-Lane-Lines/blob/master/part1/lesson17_Transfer_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson17 Transfer Learning

GPUのようにスループットを高めることで、並列計算が多いディープラーニングで威力を発揮します。

![](https://i.gyazo.com/bd904b4e0c16de166d5c3c72ea48f920.png)

ニューラルネットワークでは**転移学習**が威力が発揮します。しかし転移学習には以下のようにいくつかの用途が存在します。

![](https://i.gyazo.com/926c3e736a64e250e636d31618b9cfeb.png)

1. 新規データが少ない。元のデータに似ている。
2. 新規データが少ない。元のデータに似ていない
3. 新規データが多い。元のデータに似ている。
4. 新規データが多い。元のデータに似ている。

では以下のネットワークを使用して転移学習させていくことを想定して、それぞれのケースを想定してみましょう。

![](https://i.gyazo.com/65dc3ad06e81842c1b16d1944ee73953.png)



### Case1: Small, Similar Dataset

![](https://i.gyazo.com/22eb3b9025de7d9c91287de29b864543.png)

- 出力層を消す
- 新規データのクラス数に合う線形結合層を追加
- 追加した線形結合層を初期化し、学習済みモデルのパラメータを固定する
- 線形結合層のパラメータを学習する

過学習を避けるためにも学習済みのパラメータを固定したほうがいい。また画像の高次元な特徴も似ているため、事前学習済みのモデルは新規データによく似た特徴を補足できる能力があります。

![](https://i.gyazo.com/dd629e09c6341c457d5f7bf4c6d03918.png)

### Case2: Small, Different Dataset

![](https://i.gyazo.com/0c6df1c739139029fd93126773128173.png)

- ニューラルネットワークの入力近くの層を消去する
- 残ったモデルに新規データのクラスに合う線形結合層を追加する
- 線形結合層を初期化して、学習済みモデルのパラメータを固定する
- 追加した線形結合層を学習する

データセットが小さいため過学習に注意する必要があります。そのため事前学習済みモデルのパラメータは固定しておきます。

異なるデータセットなので、より抽象的な特徴（高次な特徴）は似ている特徴を共有してはいません。そのため低次元な特徴のみを使用したいため、入力に近い層も消去していきます。

![](https://i.gyazo.com/45ddf3392fdf0fb15144655ff11258b8.png)

### Case3: Large, Similar Dataset

![](https://i.gyazo.com/71547926fd73ac4606aed55b60a4c805.png)

- 最終の線形結合層を削除して、新規データのクラスに合う線形結合層に変える
- 追加した線形結合層を初期化する
- 事前学習済みモデルのパラメータを使用して、残りのネットワークのパラメータを初期化する
- 全体を学習しなおす

データセットが大きい場合は過学習を心配する必要があまりないため、ネットワーク全体を学習しなおします。

ただし高次の特徴は共有しているため、全体のネットワークをまだ使用することが可能です。

![](https://i.gyazo.com/f25b469d9331404462bd6b96c3355ba1.png)

### Case4: Large, Different Dataset

![](https://i.gyazo.com/6795f8cad9d3504ba27782f573febc6f.png)

- 最終の線形結合層を削除して、新規データのクラスに合う線形結合層を追加する
- ランダムに初期化したネットワークで学習しなおす
- あるいは`Large & Similar`なデータセットの戦略を採用する

データセットが似ていない場合に関しては、ネットワーク全体を学習しなおしたほうが学習が早いかもしれません。

しかし最近の論文でもあったようにここの結論はいまだ出ていません。

![](https://i.gyazo.com/5d5ee515c1b239bd9c7079a0ddd0f0fa.png)

## LeNet

元論文は[こちら](http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf)

## ImageNet

ImageNetは1000クラス、合計1400万以上の画像を分類するタスクであり、様々なネットワークが採用されました。

kerasでは[keras application](https://keras.io/applications/)を使用すれば、各モデルの事前学習済みのモデルを取得可能です。

## AlexNet

AlexNetではネットワークを2つのGPUに乗せて、より大きなネットワークを構築しています。[元の論文](https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf)では、ネットワークを並列化させることで1.7%精度を向上させることができています。

![](https://i.gyazo.com/5e0990a1e8bcfe0d98e5247aca0d8103.png)

## VGG

[元の論文](https://arxiv.org/abs/1409.1556)

VGGは3x3畳み込み層と線形結合層を組み合わせた膨大なネットワークです。

![](https://i.gyazo.com/946cc7d01a50d91226ee5ae0f55f2c54.png)


In [0]:
%tensorflow_version 1.x

In [0]:
from keras.applications.vgg16 import VGG16

# weightsはデフォルトが'imagenet'
# weightsがパラメータランダム初期化のときはNone
# include_top=Falseは、imagenet用の1000クラスの出力層を含めるかどうかを指定します。
model = VGG16(weights='imagenet', include_top=False)

In [9]:
!wget https://www.pakutaso.com/shared/img/thumb/tomcat1578_TP_V.jpg -O sample.jpg
!ls

--2020-02-25 14:11:33--  https://www.pakutaso.com/shared/img/thumb/tomcat1578_TP_V.jpg
Resolving www.pakutaso.com (www.pakutaso.com)... 180.235.251.31
Connecting to www.pakutaso.com (www.pakutaso.com)|180.235.251.31|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 399961 (391K) [image/jpeg]
Saving to: ‘sample.jpg’


2020-02-25 14:11:34 (819 KB/s) - ‘sample.jpg’ saved [399961/399961]

sample_data  sample.jpg


In [18]:
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input
import numpy as np

img_path = 'sample.jpg'
img = image.load_img(img_path, target_size=(224, 224))
print(type(img), img.size)

x = image.img_to_array(img)
print(type(x), x.shape)

x = np.expand_dims(x, axis=0)
print(type(x), x.shape)

x = preprocess_input(x)
print(type(x), x.shape)

<class 'PIL.Image.Image'> (224, 224)
<class 'numpy.ndarray'> (224, 224, 3)
<class 'numpy.ndarray'> (1, 224, 224, 3)
<class 'numpy.ndarray'> (1, 224, 224, 3)


In [19]:
from keras.applications.vgg16 import VGG16, decode_predictions

# モデルを読み込みます
model = VGG16(weights='imagenet')

# 前処理済みの画像の予測を行う
predictions = model.predict(x)
print(predictions.shape)

print('predicted: ', decode_predictions(predictions, top=3)[0])

Downloading data from https://github.com/fchollet/deep-learning-models/releases/download/v0.1/vgg16_weights_tf_dim_ordering_tf_kernels.h5
(1, 1000)
Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/imagenet_class_index.json
predicted:  [('n02123045', 'tabby', 0.32574922), ('n02124075', 'Egyptian_cat', 0.19204272), ('n02123159', 'tiger_cat', 0.098262586)]


## GoogLeNet

[元の論文](https://arxiv.org/abs/1409.4842)

インセプションのモジュールを導入することで、パラメータ数を削減してAlexNetと同等の速度で計算させることもできるうえ、AlexNetよりも精度は向上しています。

![](https://i.gyazo.com/d310e15424390e27a89696735c9660d6.png)

In [0]:
from keras.applications.inception_v3 import InceptionV3

model = InceptionV3(weights='imagenet', include_top=False)

## Caution!! ##
## 実際に使用する際は、画像のサイズを299x299に調整しましょう。

## ResNet

[元の論文](https://arxiv.org/abs/1512.03385)

ResNetでは、残差モジュールを導入することでより深い構造のネットワークを学習させることに成功しました。

In [23]:
from keras.applications.resnet50 import ResNet50

model = ResNet50(weights='imagenet', include_top=False)



## Lab: Transfer Learning

In [0]:
# フラグ設定を行う
freeze_flag = True
weight_flag = 'imagenet'
preprocess_flag = True

In [0]:
from keras.applications.inception_v3 import InceptionV3

# InceptionV3がサポートしている画像サイズは以下になります（kerasが対応しているもの）
# 299x299x3
# 139x139x3
input_size = 139


inception = InceptionV3(weights=weight_flag,
                        include_top=False,
                        input_shape=(input_size, input_size, 3))

In [0]:
if freeze_flag:
    # 層をループ処理で取得してフラグを設定する
    for layer in model.layers:
        layer.trainable = False

In [31]:
inception.summary()

Model: "inception_v3"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_10 (InputLayer)           (None, 139, 139, 3)  0                                            
__________________________________________________________________________________________________
conv2d_283 (Conv2D)             (None, 69, 69, 32)   864         input_10[0][0]                   
__________________________________________________________________________________________________
batch_normalization_283 (BatchN (None, 69, 69, 32)   96          conv2d_283[0][0]                 
__________________________________________________________________________________________________
activation_381 (Activation)     (None, 69, 69, 32)   0           batch_normalization_283[0][0]    
_______________________________________________________________________________________

もしも最終層の線形結合層を除去して、新しい層を追加したい場合は`inception.layers.pop()`を使用して、最後の層を削除します。



In [0]:
from keras.layers import Input, Lambda
import tensorflow as tf

# 入力画像を格納する変数を用意しておきます。
cifar_input = Input(shape=(32, 32, 3))

# 入力画像をリサイズします。
resized_input = Lambda(lambda image: tf.image.resize_images(image, 
                                                            (input_size, input_size)))(cifar_input)

# リサイズした画像をモデルに渡たす
inp = inception(resized_input)

In [0]:
from keras.layers import Dense, GlobalAveragePooling2D

# モデルの最終層に新規データのクラス数分の線形結合層を追加します。
x = GlobalAveragePooling2D()(inp)

# Pooling層からそのまま入力を線形結合層に入力する
x = Dense(512, activation='relu')(x)

# 最後に予測を行う
predictions = Dense(10, activation='softmax')(x)

In [36]:
# ではModel APIを使用して定義した入力などの計算をまとめます。
from keras.models import Model

# 入力と出力を定義
model = Model(inputs=cifar_input, outputs=predictions)

# モデルを事前にコンパイル
model.compile(optimizer='adam', loss='categorical_crossentropy',
              metrics=['accuracy'])

model.summary()



Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_12 (InputLayer)        (None, 32, 32, 3)         0         
_________________________________________________________________
lambda_2 (Lambda)            (None, 139, 139, 3)       0         
_________________________________________________________________
inception_v3 (Model)         (None, 3, 3, 2048)        21802784  
_________________________________________________________________
global_average_pooling2d_1 ( (None, 2048)              0         
_________________________________________________________________
dense_1 (Dense)              (None, 512)               1049088   
_________________________________________________________________
dense_2 (Dense)              (None, 10)                5130      
Total params: 22,857,002
Trainable params: 22,822,570
Non-trainable params: 34,432
________________________________________

### Keras Callback



In [0]:
from keras.callbacks import ModelCheckpoint, EarlyStopping

checkpoint = ModelCheckpoint(filepath='save_path', 
                             monitor='val_loss',
                             save_best_only=True)

stopper = EarlyStopping(monitor='val_loss', min_delta=0.0003. patience=5)

model.fit(callbacks=[checkpoint, stopper])

### GPU Time

In [37]:
from sklearn.utils import shuffle
from sklearn.preprocessing import LabelBinarizer
from keras.datasets import cifar10

(X_train, y_train), (X_val, y_val) = cifar10.load_data()

# One-hot encode the labels
label_binarizer = LabelBinarizer()
y_one_hot_train = label_binarizer.fit_transform(y_train)
y_one_hot_val = label_binarizer.fit_transform(y_val)

# Shuffle the training & test data
X_train, y_one_hot_train = shuffle(X_train, y_one_hot_train)
X_val, y_one_hot_val = shuffle(X_val, y_one_hot_val)

# We are only going to use the first 10,000 images for speed reasons
# And only the first 2,000 images from the test set
X_train = X_train[:10000]
y_one_hot_train = y_one_hot_train[:10000]
X_val = X_val[:2000]
y_one_hot_val = y_one_hot_val[:2000]

Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz


[ImageDataGenerator Docs](https://faroit.github.io/keras-docs/2.0.9/preprocessing/image/)

In [0]:
# Use a generator to pre-process our images for ImageNet
from keras.preprocessing.image import ImageDataGenerator
from keras.applications.inception_v3 import preprocess_input

if preprocess_flag == True:
    datagen = ImageDataGenerator(preprocessing_function=preprocess_input)
    val_datagen = ImageDataGenerator(preprocessing_function=preprocess_input)
else:
    datagen = ImageDataGenerator()
    val_datagen = ImageDataGenerator()

In [39]:
# Train the model
batch_size = 64
epochs = 5
# Note: we aren't using callbacks here since we only are using 5 epochs to conserve GPU time
model.fit_generator(datagen.flow(X_train, y_one_hot_train, batch_size=batch_size), 
                    steps_per_epoch=len(X_train)/batch_size, epochs=epochs, verbose=1, 
                    validation_data=val_datagen.flow(X_val, y_one_hot_val, batch_size=batch_size),
                    validation_steps=len(X_val)/batch_size)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f6bdc7c1e48>

## Additional Resources on Deep Learning

Nice work reaching the end of the deep learning content! While you still have the project left to do here, we're also providing some additional resources and recent research on the topic that you can come back to if you have time later on.

Reading research papers is a great way to get exposure to the latest and greatest in the field, as well as expand your learning. However, just like the project ahead, it's often best to learn by doing - if you find a paper that really excites you, try to implement it (or even something better) yourself!

#### Optional Reading

All of these are completely optional reading - you could spend hours reading through the entirety of these! We suggest moving onto the project first so you have what you’ve learned fresh on your mind, before coming back to check these out.

We've categorized these papers to hopefully help you narrow down which ones might be of interest, as well as highlighted a couple key reads by category by including their Abstract section, which summarizes the paper.

---

### Behavioral Cloning

The below paper shows one of the techniques Waymo has researched using imitation learning (aka behavioral cloning) to drive a car.

[ChauffeurNet: Learning to Drive by Imitating the Best and Synthesizing the Worst](https://arxiv.org/abs/1812.03079) by M. Bansal, A. Krizhevsky and A. Ogale

> Abstract: Our goal is to train a policy for autonomous driving via imitation learning that is robust enough to drive a real vehicle. We find that standard behavior cloning is insufficient for handling complex driving scenarios, even when we leverage a perception system for preprocessing the input and a controller for executing the output on the car: 30 million examples are still not enough. We propose exposing the learner to synthesized data in the form of perturbations to the expert's driving, which creates interesting situations such as collisions and/or going off the road. Rather than purely imitating all data, we augment the imitation loss with additional losses that penalize undesirable events and encourage progress -- the perturbations then provide an important signal for these losses and lead to robustness of the learned model. We show that the ChauffeurNet model can handle complex situations in simulation, and present ablation experiments that emphasize the importance of each of our proposed changes and show that the model is responding to the appropriate causal factors. Finally, we demonstrate the model driving a car in the real world.

---

### Object Detection and Tracking

The below papers include various deep learning-based approaches to 2D and 3D object detection and tracking.

[SSD: Single Shot MultiBox Detector](https://arxiv.org/abs/1512.02325) by W. Liu, et. al.

> Abstract: We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. Our SSD model is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stage and encapsulates all computation in a single network. [...] Experimental results [...] confirm that SSD has comparable accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. Compared to other single stage methods, SSD has much better accuracy, even with a smaller input image size. [...]

[VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection](https://arxiv.org/abs/1711.06396) by Y. Zhou and O. Tuzel

> Abstract: Accurate detection of objects in 3D point clouds is a central problem in many applications, such as autonomous navigation, housekeeping robots, and augmented/virtual reality. To interface a highly sparse LiDAR point cloud with a region proposal network (RPN), most existing efforts have focused on hand-crafted feature representations, for example, a bird's eye view projection. In this work, we remove the need of manual feature engineering for 3D point clouds and propose VoxelNet, a generic 3D detection network that unifies feature extraction and bounding box prediction into a single stage, end-to-end trainable deep network. [...] Experiments on the KITTI car detection benchmark show that VoxelNet outperforms the state-of-the-art LiDAR based 3D detection methods by a large margin. Furthermore, our network learns an effective discriminative representation of objects with various geometries, leading to encouraging results in 3D detection of pedestrians and cyclists, based on only LiDAR.

[Fast and Furious: Real Time End-to-End 3D Detection, Tracking and Motion Forecasting with a Single Convolutional Net](http://openaccess.thecvf.com/content_cvpr_2018/papers/Luo_Fast_and_Furious_CVPR_2018_paper.pdf) by W. Luo, et. al.

> Abstract: In this paper we propose a novel deep neural network that is able to jointly reason about 3D detection, tracking and motion forecasting given data captured by a 3D sensor. By jointly reasoning about these tasks, our holistic approach is more robust to occlusion as well as sparse data at range. Our approach performs 3D convolutions across space and time over a bird’s eye view representation of the 3D world, which is very efficient in terms of both memory and computation. Our experiments on a new very large scale dataset captured in several north american cities, show that we can outperform the state-of-the-art by a large margin. Importantly, by sharing computation we can perform all tasks in as little as 30 ms.

---

### Semantic Segmentation
The below paper concerns a technique called semantic segmentation, where each pixel of an image gets classified individually!

[SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation](https://arxiv.org/abs/1511.00561) by V. Badrinarayanan, A. Kendall and R. Cipolla

> Abstract: We present a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet. [...] The novelty of SegNet lies in the manner in which the decoder upsamples its lower resolution input feature map(s). Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample. The upsampled maps are sparse and are then convolved with trainable filters to produce dense feature maps. We compare our proposed architecture with the widely adopted FCN and also with the well known DeepLab-LargeFOV, DeconvNet architectures. This comparison reveals the memory versus accuracy trade-off involved in achieving good segmentation performance. [...] We show that SegNet provides good performance with competitive inference time and more efficient inference memory-wise as compared to other architectures. [...]