# Hour 3: Introduction to Image Classificaiton:
Python and Classification Labcamp - Milano, 25/05/2016

Alex Loosley (a.loosley@reply.de)
<br>Alex Salles (a.salles@reply.de)

## Image Classification
Now that we've learned how to munge data with pandas (get *X_train*, *y_train*), and train classifiers with sklearn (*clf.train(X_train, y_train)*), let's do some image classification!

### Image Feature Engineering:
How do we build an **X** to describe images?  Put another way, how does one engineer features for images?  With the flowers we had Sepal/Pedal size measurements.  What can we measure with images?
* Pixel Intensities?
* Shapes?
* Percent Dark Space?
* *etc.* ... something else based on domain knowledge?

<img src="https://upload.wikimedia.org/wikipedia/commons/7/78/Petal-sepal.jpg" width="300">

### SIFT FEATURES:
[SIFT](http://docs.opencv.org/3.1.0/da/df5/tutorial_py_sift_intro.html#gsc.tab=0) stands for Scale Invariant Feature Transformation.  They were invented to pull descriptive features from images that do not depend on scale, rotation, absolute brightness (assuming image isn't saturated), *etc.*

#### Examples of SIFT features:
OpenCV (open source image processing library with many great features) not compiled on gcloud instance:
[Link to Working Demo for SURF features](http://localhost:8891/notebooks/src/ImagesToSURF.ipynb)

# Building Our Own Image Classifier:
Here a bunch of stuff will happen to give you **X** and **y**.  Your job will be, have fun with this code, try to train other image classifiers, and make new predictions with your models.

In [None]:
import os

import shutil

import numpy as np
%matplotlib inline
import matplotlib.pylab as plt
import mahotas as mh
import pandas as pd

from mahotas.features import surf
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import *
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.utils import shuffle
import glob

## Load Images and Calculate SURF features (Faster Version of SIFT)
Create n by (m_n by k) list of arrays: (n=num_images, m_n=num_features per image, k=feature_dimensionality)
<br> Each item in this list represents an image, each row is a 65 dimensional SURF feature

In [None]:
# Get images file names and targets:
def get_filenames_and_targets(folder1, folder2, truncate_after=100):
    # Data set 1:
    set1_filenames = []
    for count, f in enumerate(glob.glob(folder1+'*.jpg')):
        set1_filenames.append(f)
        
        # stop for loop at some point if too many files:
        if count>truncate_after:
            break
    set1_targets = [1]*len(set1_filenames)

    # Data set 2:
    set2_filenames = []
    for count, f in enumerate(glob.glob(folder2+'*.jpg')):
        set2_filenames.append(f)
        
        # stop for loop at some point if too many files:
        if count>truncate_after:
            break
    set2_targets = [0]*len(set2_filenames)
    
    # Concatenate everything together into one dataset:
    return (set1_filenames+set2_filenames), (set1_targets+set2_targets)

# CHOOSE YOUR DATASETs
(images should be organized into two different folders, *e.g.* '../data/dogs/', '../data/cats/')

In [None]:
# Concatenate everything together into one dataset:
all_instance_filenames, all_instance_targets = get_filenames_and_targets('../data/dogs/', '../data/cats/')

### Generate X_train, y_train, X_test, and y_test:

In [None]:
surf_features = []
counter = 0
for idx, f in enumerate(all_instance_filenames):
    if idx % 100 == 0:
        print '.'
    image = mh.imread(f, as_grey=True)
    surf_features.append(surf.surf(image)[:,5:]) 
    # Above is a n X (m_n X k) list (n=num_images, m_n=num_features per image, k=feature_dimensionality)

train_len = int(len(all_instance_filenames)*.6)
surf_features, all_instance_targets = shuffle(surf_features, all_instance_targets)

X_train_surf_features = np.concatenate(surf_features[:train_len])
X_test_surf_features = np.concatenate(surf_features[train_len:])
y_train = all_instance_targets[:train_len]
y_test = all_instance_targets[train_len:]

# Cluster Features to reduce dimensionality:
n_clusters=300
print 'Clustering', len(X_train_surf_features), 'features'
estimator = MiniBatchKMeans(n_clusters=n_clusters)
estimator.fit_transform(X_train_surf_features) # Get 300 clusters of 65 dimensional SURF vectors 
# from the 1,517,484 training instances of SURF vectors

X_train = []
for instance in surf_features[:train_len]:
    if instance.shape[0] > 0:
        clusters = estimator.predict(instance) # deterime the cluster that each SURF feature belongs to, in each training instance
    features = np.bincount(clusters) # create histogram of the number of features corresponding to each of the 300 clusters
    if len(features) < n_clusters:
        features = np.append(features, np.zeros((1, n_clusters-len(features))))
    X_train.append(features)
    
X_test = []
for instance in surf_features[train_len:]:
    if instance.shape[0] > 0:
        clusters = estimator.predict(instance) # deterime the cluster that each SURF feature belongs to, in each training instance
    features = np.bincount(clusters) # create histogram of the number of features corresponding to each of the 300 clusters
    if len(features) < n_clusters:
        features = np.append(features, np.zeros((1, n_clusters-len(features))))
    X_test.append(features)

# Use X_train and y_train to train a classifier:
How about Logisitic Regression?

In [None]:
clf = LogisticRegression(C=0.001, penalty='l2')
clf.fit_transform(X_train, y_train)
predictions = clf.predict(X_test)
print classification_report(y_test, predictions)

# Prediction API:

In [None]:
def predict(clf, file_path):
    X_SURF = surf.surf(mh.imread(file_path,as_grey=True))[:,5:]
    X_new_prediction = np.bincount(estimator.predict(X_SURF))
    if len(X_new_prediction)<n_clusters:
        X_new_prediction = np.append(X_new_prediction, np.zeros((1, n_clusters-len(X_new_prediction))))
        
    plt.imshow(mh.imread(file_path, as_grey=True), cmap='gray')
    return clf.predict(X_new_prediction)

In [None]:
predict(clf, '../data/cats/cat.3.jpg')

# Can you get better results with another model?
Hint: Sklearn API is standardized for all models, therefore one can train other classifiers just like one trained the logistic regression classifier above.

(Hints Below)

# Find another dataset of images online, can you train another cool image classifier?
1. Find images online (for example [here](https://www.google.it/?client=firefox-b-ab#q=labelled+image+datasets))
    1. If calculating SURF features takes too long for your image set, limit the number of pictures you train with.
1. Upload them to our class instance, place all images of a single label in a specific folder under */data/[custer folder_name]*.
    1. If they're less than 25mb, one can upload directly with jupyter interface.
    1. If images are more than 25mb, give the link to me and I will download them to our class instance
1. Point your binary classification model to train from */data/[custer folder_name]* and test the results
1. Test your model on picture of yourself or something else fun :-)

We'll see how you do at the end, who was most creative or trained the best model?

### Hints:

-    KNeighborsClassifier(3),
-    SVC(kernel="linear", C=0.025),
-    SVC(gamma=2, C=1),
-    DecisionTreeClassifier(max_depth=5),
-    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
-    AdaBoostClassifier(),
-    GaussianNB(),

see sklearn documentation [here](http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html) for details on how to use these models, and what they are.