# Classification -- Images & Hands-On

## Table of Contents
<ol>
    <li>Processing of complicated data like images</li>
    <li>Thinking about models to use for image classification</li>
    <li>Implementation of common models</li>
    <li>Convolutional neural networks -- an ML greatest hit</li>
</ol>

## 1. Processing of complicated data like images

#### Suppose we begin with colored 32 x 32 pixel images of objects we wish to classify.

![](cifar.png)
<span style="font-size:0.75em;">CIFAR-10 Krizhevsky et al.</span>

### How can we encode the information from one image?
![](corgis.png)
![](doge.png)
<span style="font-size:0.75em;">commonlounge.com; subsubroutine.com</span>

### Let's start with a simpler example

In [1]:
import keras
import tensorflow as tf
print(keras.__version__)
print(tf.__version__)

Using TensorFlow backend.


2.2.4
1.13.1


### As long as each data point is of the same shape, we can unroll these 2- or 3-tensors into long vectors
- How many dimensions in each CIFAR-10 data point? Remember this number.

### Our classification output will be a vector of 0s except for the target class, which should be a 1.
### Presently our output is instead encoded as a single ordinal variable between 0 and 9.

In [None]:
y_ord = y # In case we need this later

# Encode each element of y as 10-length "one-hot vector" with binary elements


### Lastly, let's choose an error metric

In [1]:
def score(true, pred):
    # some accuracy metric
    return score

## 2. Thinking about models to use for image classification

### k-Nearest neighbors

* 1-Nearest Neighbors (aka nearest neighbors)
    - Use some distance metric to compare each 784-D vector to all training examples
    - Order samples by distance
    - Classify the same as the smallest distance example
* k-Nearest Neighbors
    - Classify by committee based on several small distances
* Where do these fail?
* How do these scale with training examples?

### Logistic regression
* Optimal parameters attained from maximizing the likelihood of dataset, aka minimizing the negative log-likelihood
$$\mathcal{L}(\theta = \{W,b\},\mathcal{D}) = \sum_{i=0}^{|\mathcal{D}|} log(P(Y = y^{(i)} | x^{(i)},W,b))$$
![](log_reg.png)
* "nonlinear" -- though always depends directly on weighted sums of pixels

### Random forest classifiers
* "Split" predictions based on pixels or collection of pixels
* Truly nonlinear

### Feed-forward neural networks
* Nonlinear
* Permits "communication" between pixels via dense layers

## 3. Implementation of common models
### WAIT what haven't I done yet?

### k-Nearest Neighbors

In [None]:
from sklearn.neighbors import KNeighborsClassifier


### Logistic regression

In [None]:
from sklearn.linear_model import LogisticRegression


In [10]:
# Create equivalent accuracy score for ordinal outputs
def score_ord(true, preds):
    acc = 0
    for i in range(len(true)):
        if true[i] == preds[i]:
            acc += 1
    score = acc / len(true)
    return score

In [None]:
print(score_ord(y_ord_val,preds))

### Random forest classifiers - you try

In [None]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()


### Feed-forward neural networks

In [13]:
from sklearn.neural_network import MLPClassifier
model = MLPClassifier(hidden_layer_sizes=(100,50,20))
model.fit(X_train, y_train)
preds = model.predict(X_val)
print(score(y_val, preds))

0.9708333333333333


## 4. Convolutional neural networks

![](conv_net.png)
<span style="font-size:0.75em;">easy-tensorflow.com</span>


* Generally, a <i>filter</i> is a rectangular $k \times l$ matrix of weights.
* A single <i>filter</i> traverses an image, elementwise multiplying pixels in its range, adding, and performing a nonlinearity.
* The new <i>feature map</i> generated is typically the same size or smaller than the input.
* Many <i>feature maps</i> are generated with different <i>filters</i>, each with different weights
* <i>Pooling layers</i> serve to reduce feature map dimensionality.
* The CNN concludes with generic Dense layers

### Examines local areas of photographs -- takes full photo matrix as input, not flattened

In [14]:
(X, y), (X_test, y_test) = mnist.load_data()
y_train = keras.utils.to_categorical(y)
y_test = keras.utils.to_categorical(y_test)

### Suppose I have validated the following hyperparameters such that I believe they are optimal. <i>Now</i> we can test.

In [15]:
# Choose train and test, add an additional 1 rank to indicate
# greyscale for special layers


In [16]:
from keras.models import Sequential
from keras.layers import Dense, Conv2D, BatchNormalization, Dropout, Flatten

model = Sequential()

model.add(Conv2D(32,kernel_size=3,activation='relu',input_shape=(28,28,1)))
model.add(BatchNormalization())
model.add(Conv2D(32,kernel_size=3,activation='relu'))
model.add(BatchNormalization())
model.add(Conv2D(32,kernel_size=5,strides=2,padding='same',activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.4))

model.add(Conv2D(64,kernel_size=3,activation='relu'))
model.add(BatchNormalization())
model.add(Conv2D(64,kernel_size=3,activation='relu'))
model.add(BatchNormalization())
model.add(Conv2D(64,kernel_size=5,strides=2,padding='same',activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.4))

model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(BatchNormalization())
model.add(Dropout(0.4))
model.add(Dense(10, activation='softmax'))

model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])

history = model.fit(X_train, y_train, batch_size=128, epochs=25, verbose=1)

Instructions for updating:
Colocations handled automatically by placer.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Instructions for updating:
Use tf.cast instead.
Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


![](KaggleMNIST.png)
<span style="font-size:0.75em;">Kaggle - Chris Deotte 2018</span>

In [22]:
preds = model.predict(X_test)
final = np.zeros_like(preds)
final[np.arange(len(preds)), preds.argmax(1)] = 1

print(score(y_test, final))

0.9953


In [23]:
model.save("nice_cnn.h5")

### A word on inductive bias and domain knowledge
* CNNs take advantage of our understanding of local features in mapping images to semantic meaning (number labels, dog/cat/plane)
* "A universal function approximator": The infinitely large dense NN can fit any analytic function exactly with enough data.
    - "Not really": We rarely have "enough data" and can't train infinitely large NNs
    - The name of the game is making the network size and data requirements <i>practical</i>
* The state of the art usually comes from understanding your problem first