# Table of content
[5.1. Why object recognition is difficult](#object_reg)  
[5.2. Ways to achieve viewpoint invariance](#viewpoint_invar)  
[5.3. Convolutional neural networks for hand written digit recognition](#hand_written_digit_reg)  
[5.4. Convolutional neural networks for object recognition](#conv_net_obj_reg)  

## 5.1. Why object recognition is difficult
<a id="object_reg"> </a>
- Things that make object recognition hard to recognize objects:  
    - Segmentation: real scenes are mixed with other objects:  
        - hard to tell which pieces go together as parts of the same object.
        - parts of an object can be hidden behind other object (suffers from the occlusion due to other objects).
    - Lighting: pixel intensity is as dependent on object as it is on lighting $\rightarrow$ variations in perspective lighting
    - Scale (Deformation): objects can deform in a variety of non-affine ways (i.e. a hand-written $2$ can have a large loop or just a cusp). Same object can look very different (for example, written numeral $2$ or $4$).
    - Affordances: object classes are often defined by how they are used (i.e. chairs are things designed for sitting on so they have a wide variety of physical shapes. You sit in a chair, but modern vs classic chairs can be widly different, then you have to have knowledge that the thing is to be sat on).  
    $\rightarrow$ many objects are defined more by what it is used for than what it looks like
    - Viewpoint/Transformation: changes in viewpoint cause changes in images that standard learning methods cannot cope with. Information hops (n~ bước nhảy thông tin) between input dimensions (i.e. pixels)
    ![infor_hops](images/infor_hops.png)
    i.e. A medical database in which the age of the patient is sometimes labeled incorrectly as the patient's weight - the example gives age and weight randomly changing locations - this is called "dimension hopping" which needs to be eliminated before applying ML. Viewpoint changes cause "dimension hopping".

## 5.2. Ways to achieve viewpoint invariance
<a id="viewpoint_invar"> </a>
- A few common approaches:
    - Use redundant invariant features
    - Box objects and normalize pixels: put a box around the object and use normalized pixels
    - Replicated features with pooling (convolutional neural nets - $\color{red}{Lecture\ 5.3}$
    - Hierarchy of parts that have explicit poses relative to camera ($\color{red}{Lecture\ 5.e}$)
- Details:
    - Invariant feature approach: 
        - Extract a large, redundant set of features that are invariant under transformations. 
        - The underlying assumption is based on the observation that humans can effortlessly detect objects in different poses and lighting conditions and, so, there must exist properties or features which are invariant over these variabilities. 
        - With enough invariant features, there is only one way to put them together into an image (relationships between features are automatically captured by other features due to multiple overlaps). 
        - But for recognition, need to avoid features that are parts of objects.
    - _Judicious normalization approach_ (boxing/normalizing objects):
        - Put a box around the object and use it as a coordinate frame for a set of normalized pixels.
        - Solves "dimension hopping" if the box is always done correctly, the same part of an object always occurs on the same normalized pixels.
        - Can provide invariance to many degrees of freedom: $\color{red}{translation,\ rotation,\ scale,\ shear\ (dich\ chuyen),\ stretch\ ...}$
        - Boxing, however is difficult, due to segmentation errors, occlusion, unusual orientations.
        - Need to know what the shape is in order to box it right, which is the problem looking to solve already.
    - _Brute Force normalization approach_ (boxing):
        - When training the recognizer, use very clean data (use well-segmented, upright images) for training, to fit the correct box, so boxing can be done accurately and cleanly.
        - At test time, try all possible boxes in a range of positions and scales, try to throw noisier less clean data.
        - Is widely used for detecting upright things like faces and house numbers in unsegmented images.
        - Important that the network can tolerate some sloppiness in the boxing so more coarse/less accurate boxing can be dont at test time. (it is much more efficient if the recognizer can cope with some variation in position and scale so that we can use a coarse grid when trying all possible boxes).

## 5.3. Convolutional neural networks for hand written digit recognition
<a id="hand_written_digit_reg"> </a>

### 5.3.1. The replicated feature approach (currently the dominant approach for NNs)
- The convolutional neural network is based on the idea of the replicated features.  
    - use many different copies of the __same feature__ detector with different positions.  
        - could be replicated across the scale and orientation (tricky and expensive) $\rightarrow$ no good.  
        - replication helps to reduce the number of free parameters that are have to learned.
    - when you use several __different feature types__, each of them will has its own map of replicated detectors (each has its own convolution function).  
        - each map will have replicas of the same feature.
        - feature is constrained to be identical in different places.
        - different map will learn to detect different features.

### 5.3.2. The backpropagation with weight constraints
- The replicated features fit with the backpropagation (modify the backpropagation algorithm to incorporate linear constraints between the weights).
- Compute the gradients as usual, then modify the gradients so that satisfy the linear constraints, so they will also satisfy the linear constraints before and after the weight update.  
    Ex: To constrain: $w_1=w_2$, we need $\Delta w_1=\Delta w_2$ and start off with $w_1=w_2$.  
    The way we do is that we compute the gradient of the error w.r.t. $w_1, w_2$: $$\frac{\partial E}{\partial w_1}, \frac{\partial E}{\partial w_2}$$.
    Then we use the sum or the average of both gradients for weight update, $\frac{\partial E}{\partial w_1} + \frac{\partial E}{\partial w_2}$ for $w_1, w_2$.
    By using the weight constraints like this, we can force back propagation to learn replicated feature detectors.

### 5.3.3 What does replicating the feature detectors achieve?
- Equivariant activities: what replicated features achieved is the equivariance, not invariance in the image translation.
    ![translated_image](images/translated_image.png)
    when the image is translated, the black dots are also translated, so the image is translated and the representation also changed as much as. This is the equivariance, not invariance.
- Invariant knowledge: There is something that is invariant, which is the knowledge. Therefore, if you learn replicate feature detectors, and you know how to detect a feature in a place, you'll know how to detect that same feature in another place.  
    If a feature is useful in some locations during __*training*__, detectors for that feature will be available in all locations during __*testing*__.  
    **Summary**: Replicating the feature achieves equivariance in the activities, but invariance in the weights.

### 5.3.4. Pooling the outputs of replicated feature detectors
- If you want to achieve some invariance in the activities, how we do?  
- It is that you pool the outputs of replicated feature detectors.
- Because of this, you can get a small amount of translational invariance at each level of a deep net, by averaging **four** neighboring replicated detectors to give a single output to the next layer.  
    - Due to this averaging, it reduces the number of inputs for the next layer of feature extraction, thus allowing us to have more different feature maps, allowing us to learn more different kinds of features in the next layer.  
    - The pooling can be done by averaging or taking the maximum of **four** neighboring replicated detectors.  
- **Problem**: After several levels of pooling, we have lost precise information about _where_ (the precise positions of) things are. This makes it is impossible to use the precise spatial relationships between high-level parts for recognition.  
    Example: For face detecting, if you want to recognize whose face it is, you need to use the precise spatial relationships between some components of the face (e.g. between the eyes, between the nose and the mouth), which is lost by these convolutional neural nets.  
    It means that allowing us to recognize if the image is a face, but "if you want to recognize whose face it is", you need the precise spatial relationships between high-level parts, which have been lost in CNNs.

## 5.4. Convolutional neural networks for object recognition
<a id="conv_net_obj_reg"> </a>