# TOPICS: CNNs and transfer learning
* From scratch
* Reusing a pretrained model
* Fine-tuning
* Data augmentation
* Feature visualization

### Convolution
sliding a filter across an image <br>
在python中：<br>
result = scipy.signal.convolve2d(img, kernel, 'same')

Filter的作用：np.tensordot(img, kernel).<br>
比如3\*3的kernel, np.tensordot(A, B) = $\sum_{i=1}^3\sum_{j=1}^3A_{ij}B_{ij}$.

所以4\*4的img, 3\*3的kernel, stride 1, 会变成2\*2的结果。

这是2D，3D也是一样。都是点乘之后变成1个值。

### Padding

### Filters
More filters, more output channels.
* 每个filter learn一组weights。 每个filter 都连上每个input channel.
* An output channel is called a **feature map**

1d convolution. <br>
weights是1\*\*depth. 不改变width&heights, 但是把depth压缩到1.



### Max pooling
还有average pooling, 注意n\*n的格子是把n^2个格子和除以n.

### Padding.
* 保持filter之后spatial size.<br>
* padding = 'same'是有，padding='valid'是no padding.
* padding='same'是在周围垫上一圈0

### Stride
一次滑动的步长。可以通过它来downsize image.不过现在通常用pooling

### Common Setup
* One or more stacks of conv/pool layers with relu activation, followed by a flatten then one or two dense layers.
* Feature maps become smaller spatially, and increase in depth.
* Feature become more abstract but lose spatial information



### 用CNN而不是DNN识别图像的原因
* Efficiency. Dense需要多得多的parameter
* Features must be detected separately at all locations.

In [0]:
# A typical CNN in keras
model = Sequential()
model.add(Conv2D(32, kernel_size=(3,3),
                activation='relu',
                input_shape=(28,28,1),
                padding='same'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Conv2D(64, kernel_size=(3,3),
                activation='relu',
                padding='same'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dense(10, activation='softmax'))

### 不同层\# params
* Dense: \# params = (\#input +1) \* (\#output)
* Pooling: 0
* Conv: \# params = (filter size) \* (filter size) \* (input depth) \* (\#filters) +  (\#filters) <br>
第二项是bias, 1 per filter
* Flatten: 0

## 一些有名的model


## AlexNet
ReLU; Use data augmentation; Dropout

## VGG
Simple, inefficient

## Inception
Efficient, 并且结构不再是conv/pool的堆积

#### Global average pooling
发现CNN中最后的dense layer含有最多的weights,并且对效率作用不大，所以在iinception paper里，它们在flatten dense之前，加了个global average pooling，极大减少parameter数量。

#### Basic idea
Instead of choosing which size of filter to use, run a few in parallel and let the network sort out which are useful.
results merged by stacking depthwise.



## Transfer learning
**Idea**: knowledge (weights) learned on one task may be useful on another.
* The base of a CNN learns a feature hierachy : edges -> shapes -> textures -> ... -> semantic features (eye detectors, ear detectors, etc)
* Earlier features may generalize to other tasks (especially if trained on a large anount of data, say, ImageNet)

通过可视化表示可发现，随着从底层向上，resolution下降。representation更多的说明图中是什么，而不是在哪儿。

本来完整的CNN结构是，Input -> Trained convolutional base -> Trained classifier -> Prediction. 
这里我们保存trained convolutional base（来自大量训练的结果）,对当下的任务加上新的classifier (randomly initialized)再给出prediction。

就可以train an accurate model with a small amount of data.

可以fine-tune the top couple conv layers.

### ResNet
<font color='red'>不太懂</font>

Basic idea:
* 152 layers deep
* Problem with deep networks? <font color='red'>Vanishing gradient</font>
* Mitigated by adding residual connections allowing signal to propagate the signal.

### YOLO
Single forward pass object detection: 就是给张图，框框出物体和识别的label
* Resize input to 448\*448
* Run a CNN
* Output gives bounding boxes and class labels.

Image is divided into S\*S grid. For each grid cell, predict B bounding boxes and C class probabilities.

**Predictions**: include x, y, w, h, confidence. (x,y) gives center of the bounding box. Confidence gives IOU (Intersection over union) estimate between predicted box and ground truth.

最好retraining or fine tuning existing well known architecture for tasks, 而不是自己重新设计模型。

## Data Augmentation
只能用于training set.
