# Ch.2: Image Classification  
based on deep learning for computer vision in Michigan Online: https://www.youtube.com/watch?v=0nqvO3AM2Vw&list=PL5-TkQAfAZFbzxjBHtzdVCWE0Zbhomg7r&index=4

In Algorithm, it takes input as an image and in output, it assigns image to one of a fixed set of categories

![(2.0.1)image classification example](pictures/2.0.1.jpg)

<center>(2.0.1) image classification example

The main challenge in image recognition and image classification is the probelm we called 'Semantic Gap'

The semantic gap characterizes the difference between two descriptions of an object by different linguistic representations. In this case, we use the term semantic gap to introduce difference of visual recognition on image, between human and algorithm.

For example, we recognize pircture as itself, not aware of the process like lights goes into our eyes->retina->brain..and processing. But the computer doesn't have these kinds of intuition, and it recognizes an image as a giant grid of numbers between [0, 255]

There is no such evidence that this grid of numbers will represent these semantically meaningful category. And even worse, these entire grid of numbers can drastically change, because of these kinds of challenges:

1. Viewpoint variation: difference in viewpoint makes its image to have different set of numbers. So the computer may not recognize correlations between these images.
2. Intraclass variation: variance in specific category can also cause difficulties for the computer in classifying images.
3. Fine-grained categories: while intraclass variation is a kind of difficulty classifying less-similar images in one category, fine-grained categories are challenges caused by similar images in different category.
4. Background clutter: the object we want to classify may not appear clearly because of its background.
5. Illumination changes: brightness might change values, challenging the computer to recognize its characteristics.
6. deformation: the object we want to recognize may appear in different poses(or conditions).
7. Occlusion: the object even may not visible.

![(2.0.2) challenges in classification](pictures/2.0.2.jpg)
<center>(2.0.2) challenges in classification

## Image Classification: Building Block for other tasks

Image classification is also a fundamental building block of different algorithms we might want to perform inside computer vision

** Object detection: draw boxes around the objects in images and showing where are they located and what are they.

    We can use image classification as a sub part to build up more complex applications like object detection.

    For example, one way to perform object detection is via image classification of different sliding windows in the image. So one way to perform object detection is just to classify differnet sub-regions of the image.

** Image Captioning: Writing a natural language sentece to describe what is in the image
    often framed as a sequence of image classification problem

    Image classification can be used here as a tool to choose appropriate words for image.



![(2.0.3) building block ](pictures/2.0.3.jpg)

<center>(2.0.3) image classification as a building block

## An image classifier

We need some ways that is more robust or more scalable or ways that doesn't require us to write down all of human knowledge about what different type of objects look like;here's where we come to machine learning

## Machine Learning: Data-Driven Approach

Instead of encoding all human knowledge, we'll take a data-driven approach and have algorithms that can learn from data

### the basic pipeline for this machine learning
1. Collect a dataset of images and labels
    collect images and label them with the types of lables we want our algorithm to predict
2. Use Machine Learning to train a classifier
    Deploy some kind of machine learning algorithm which will try to learn statistical dependencies between the input images in the dataset and the output labels that we wrote down during the data collection process
3. Evaluate the classifier on new images

Actually, we need two piece API:

1. train function:
    input collections of images and their associated labels to perform, and return some statistical model
2. predict function:
    input the model that we produced during the training phase, and new images on which to evaluate that model

We don't need to recode our machine learning algorithm even if the images or objects are changed.

## Image Classification Datasets

#### Mnist <br/>
10 classes: Digits 0 to 9<br/>
28*28 grayscale images  <br/>
50k training images  <br/>
10k test images  <br/>

    Take in mind that results from MNIST often do not hold on more complex datasets

#### CIFAR10<br/>
10 classes: ariplane, automobile, bird,...,etc <br/> 
32*32 RGB images  <br/>
50k training images(5k per class)  <br/>
10k testing images(1k per class)  <br/>

    Even its size is small, compared to other large-scale datasets, but it is reasonably challenging since these categories are reasonably difficult to recognize

#### ImageNet<br/> 
1000 classes  <br/>
variable size, but often resized to 256*256 for training<br/>
~1.3M training images(~1.3K per class)  <br/>
50k validation images(50 per class)  <br/>
100k test images(100 per class)  <br/>

    Gold standard for image classification datasets

    Performance metric: Top 5 accuracy  
    Algorithm predicts 5 labels for each image; one of them needs to be right

#### MIT places<br/>
365 classes: scene types<br/>
variable size, but often resized to 256*256 for training<br/>
~8M training images<br/>
18.25K validation images(50 per class)<br/>
328.5K test images(900 per class)<br/>

<br/>

complexities of the visual recognition tasks(ImageNet, Places365)<br/>
$$|$$  
$$|$$  
$$CIFAR$$  
$$|$$  
$$|$$  
computational affordability of smaller datasets(MNIST)

<br/>

### Image Classification Datasets-other direction

It's definitely one interesting direction for research how can we use bigger datasets to enhance abilities of our algorithms to perform robust classification.

But people also started thinking in the other direction as well

#### Omnigot<br/>
1623 categories: characters from 50 different alphabets  
20 images per category<br/> 

    Meant to test few shot learning: when we want to benchmark the ability of algorithms to learn with relatively little data  
    
    This method, so-called low shot classification problem is a really huge and emergin key area of research

## First classifier: Nearest Neighbor

We talked about two pieces of API: train and predict.

For nearest neighbor, the train function is trivial. it will simply going to memorize all data and labels(no sending, processing,etc).

In the predict side, we'll take our new image that we want to predict a label for compare it to each one of our images in the training set using some kind of comparison or similarity function. we'll keep track of the most similar image in the training set to our test image and return the label of the most similar training image.

Very straightforward running algorithm, and it learns in the sense that it is kind of memorizes the training data

### Distance Metric to compare images

We need some functiona that can compute the similarity between two input images. The very common choice is to use Distance metric to compare images which can input a pair of images and then spit out some number representing how sementically similar are those two pairs of images in order to perform this nearest neighbor classification.

1. Using L1(Manhattan) distance
$$\mathbf{d_{1}(I_{1},I_{2})} = \sum_{P} \left\lvert{I_{1}^{P}-I_{2}^{P}}\right\lvert$$

![(2.0.4) L1 distance ](pictures/2.0.4.jpg)

<center>(2.0.4) Distance Metric-L1 distance

In [1]:
import numpy as np

class NearestNeighbor:
    def __init__(self):
        pass
    
    #----------------Train step--------------#
    def train(self, X, y):
        """X is N*D where each row is an example. Y is l-dimension of size N"""
        #the nearest neighbor classifier simply remembers all the training data
        self.Xtr = X
        self.Ytr = y
    
    def predict(self, X):
        """X is N*D where each row is an example we wish to predict label for"""
        num_test = X.shape[0]
        #let's make sure that the output type matches the input type
        Ypred = np.zeros(num_test, dtype = self.ytr.dtype)


    #-----------For each test image: Find nearest train image
    #                       return label of min---------------
    
        #loop over all test rows
        for i in xrange(num_test):
            #find the nearest training image to the i'th test image
            #using the L1 distance (sum of absolute value differences)
            distances = np.sum(np.abs(self.Xtr - X[i,:]), axis = 1)
            min_index = np.argmin(distances)
            Ypred[i] = self.ytr[min_index]
        
        return Ypred

1. With N examples, how fast is training?
->about constant time
2. With N examples, how fast is testing?
->linear time

*This is bad: we can afford slow training, but we need fast testing*

Later, we'll talk about neural-network method, which takes relatively long time on training but relative short time in testing.

The result of nearest neighbor is quite different from what we wanted: since this method finds training images that have 'similar pixels' in the test image, the result might not have labels that we expected.

But if you have situations when you really need to use this method, take in mind that here are also many methods for fast/approximate nearest neighbors
(https://github.com/facebookresearch/faiss)

### Nearest Neighbor Decision Boundaries

Decision boundary is the boundary between two classification regions

![(2.0.5) Decision boundary ](pictures/2.0.5.jpg)

<center>(2.0.5) Decision boundary

#### K-Nearest Neighbors

To make its boundaries smooth and its algorithms to learn robustly, instead of copying label from 'one' nearest neighbor, we can take majority vote from 'K' closest points

![(2.0.6) K-Nearest Neighbors ](pictures/2.0.6.jpg)

<center>(2.0.6) K-Nearest Neighbors

#### K-Nearest Neighbors: Distance Metric
L1 (Manhattan) distance
$$\mathbf{d_{1}(I_{1},I_{2})} = \sum_{P} \left\lvert{I_{1}^{P}-I_{2}^{P}}\right\lvert$$

vs.

L2 (Euclidean) distance
$$\mathbf{d_{2}(I_{1},I_{2})} =\sqrt{ \sum_{P} {(I_{1}^{P}-I_{2}^{P})}^{2}}$$

![(2.0.7) L1 vs. L2 ](pictures/2.0.7.jpg)
![(2.0.7-1) Decision boundaries ](pictures/2.0.7-1.jpg)
<center>(2.0.7) L1 vs. L2

As we use different distance metrics we get sort of qualitively different properties in the decision boundaries.

With L1 classification, we can see that all of the decision between categories are all composed of access aligned chunks. They're either vertical,horizontal,and 45-degress angle line segments.

But when we use L2 distance classification, those lines can appear at any orientation in the input.

Using different distance metrics can be a way that human expert can imbue some of your own human knowledge into the structure that you want the algorithm to take account of.

With the right choice of distance metric, we can apply K-Nearest Neighbor to any type of data

c.f) K-Nearest Neighbors: Web Demo <br/>
http://vision.stanford.edu/teaching/cs231n-demos/knn

### Hyperparameters

choices about our learning algorithm that we don't learn from the training data<br/> (ex: in Nearest Neighbor classification, K and distance metric is hyperparameters)

Since we cannot set them directly through learning, so we need some other mechanism to choose which values of hyper parameters are going to work fast on our data.

There are not a lot of great ways in practice to choose hyper parameters, but one kind of simplest approach is that they're very problem-dependent, so try out different values.

#### Setting Hyperparameters

1. Idea #1: Choose hyperparameters that work best on the data
    we should select the values of hyperparameters that will cause our learning algorithm to give us the highest accuracy on our training set

    ->Bad: K = 1 always works perfectly on training data<br/>
     highest accuracy on training set doesn't mean same as highest accuracy on test set<br/>

<br/>

2. Idea #2: split data into train and test, choose hyperparameters that work best on test data

    ->Better than Idea #1, but bad: No idea how algorithm will perform on new data
    This idea is a different way of learning on the test set(polluted with knowledge of that test set).
    
    it makes fundamental error
    
        c.f.) There is a popular assumption in standard supervised learning, called independent identical distributions assumption(a.k.a. i.i.d. assumption): We assume that training data and test are drawn independently from identical distributions. <br/>
        When we use idea #2, it uses test data in train process, which ruins i.i.d. assumption. The popular solution for this is making validation data set, the test set used only for hyperparameter setting(It comes in Idea #3)
<br/>

3. Idea #3: Split data into train, val, and test; choose hyperparameters on val and evaluate on test

    ->Better than Idea #1,#2.

<br/>

4. Idea #4: Cross-Validation: Split data into folds, try each fold as validation and average the results

    ->Most robust way to choose hyper parameters, but because of expense, it doesn't typically get done in most machine learning projects
    
        c.f.)Also called as K-Fold Cross Validation
        Often used when we don't have enough train data to make validation set.  
        Devide initial training data into K non-overlapping subsets, and experiment K times with K-1 training subsets and one validation subset.
        The training and validation error will be the mean value of K experiments


![(2.0.8) setting hyperparameters ](pictures/2.0.8.jpg)
<center>(2.0.8) setting hyperparameters

![(2.0.8.1) 5-fold cross-validation example ](pictures/2.0.8.1.jpg)
<center>(2.0.8.1) 5-fold cross-validation example


### K-Nearest Neighbor: Universal Approximation

K-Nearest Neighbor acutally makes very few assumptions but the types of functions that it can represent.

As we take the number of training samples to infinity then K-nearest neighbor can actually represent any function of course

(It subjects to many technical conditions, of course. Only continuous functions on a compact domain;need to make assumptions about apcing of training points;etc).

ex)continuous valued prediction using a nearest neighbor approach

![(2.0.9) continuous valued prediction ](pictures/2.0.9.jpg)
<center>(2.0.9) continuous valued prediction

In this example, 1-Nearest Neighbor function is used

#### Problem: Curse of Dimensionality

For uniform coverage of space, number of training points needed grows exponentially with dimension.  
In order to get a kind of uniform coverage of the full space of a training set, we need a number of training samples which is exponential in the dimension of the underlying space.

Number of possible 32 x 32 binary images(like images in CIFAR 10):
$2^{32\times32} \approx 10^{308}$ <br/>
Number of elementary particles in the visible universe:
$\approx 10^{97}$

![(2.0.10) curse of dimensionality ](pictures/2.0.10.jpg)
<center>(2.0.10) curse of dimensionality

#### K-Nearest Neighbor on raw pixels is seldom used

- very slow at test time

- it's very difficult to get enough data to cover the space of all possible images

- distance metrics on raw pixel values is not very semantically meaningful

But Nearest Neighbor using feature vectors computed from deep convolutional neural networks works well both in image classification and image captioning

## Summary

In image classification we start with a training set of images and labels, and must predict labels on the test set

Image classification is challenging due to the semantic gap: we need invariance to occlusion, deformation, lighting, intraclass variation, etc

Image classification is a building block for other vision tasks

The K-Nearest Neighbors classifier predicts lables based on nearest training examples

Distance metric and K are hyperparameters

Choose hyperparameters using the validation set;only run on the test set once at the very end