# Applying Clustering to learn features in a Semi-Supervised Learning Model

---

_Here we will combine clustering with classification in a semi-supervised
learning problem. You will learn features by clustering unlabeled data and use the
learned features to build a supervised classifier._

## The Problem : Clustering to learn features
In this example we will build a semi-supervised learning system that can classify
images of cats and dogs.
We will cluster the descriptors extracted from all of
the images to learn features. We will then represent an image with a vector with one
element for each cluster. Each element will encode the number of descriptors extracted
from the image that were assigned to the cluster. This approach is sometimes called
the bag-of-features representation, as the collection of clusters is analogous to the
bag-of-words representation's vocabulary.



## The Data Set
We will use 1,000 images of cats and 1,000
images of dogs from the training set for Kaggle's Dogs vs. Cats competition. The dataset
can be downloaded from https://www.kaggle.com/c/dogs-vs-cats/data. We
will label cats as the positive class and dogs as the negative class. Note that the images
have different dimensions; since our feature vectors do not represent pixels, we do not
need to resize the images to have the same dimensions. We will train using the first 60
percent of the images, and test on the remaining 40 percent.

In [1]:
import numpy as np
import mahotas as mh
from mahotas.features import surf
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import *
from sklearn.cluster import MiniBatchKMeans
import glob

A naïve approach to classifying images would be to use the intensities, or brightnesses, of all of the pixels as
explanatory variables. This approach produces high-dimensional feature , non-sparse vectors for
even small images. Furthermore, this approach
is sensitive to the image's illumination, scale, and orientation.

Speeded-Up Robust Features (SURF) is a method for extracting features from an
image that is less sensitive to the scale, rotation, and illumination of the image. SURF is more effective at recognizing features across images that have been transformed in certain ways.

## Loading the images, converting into grayscale, and extracting the SURF descriptors. 
SURF descriptors can be extracted more quickly than many similar
features, but extracting descriptors from 2,000 images is still computationally
expensive and hence would require several minutes for execution on most computers.

In [None]:
all_instance_filenames = []
all_instance_targets = []
for f in glob.glob('datasets/train/*.jpg'):
    target = 1 if 'cat' in f else 0
    all_instance_filenames.append(f)
    all_instance_targets.append(target)
surf_features = []
counter = 0
for f in all_instance_filenames:
    print('Reading image:', f)
    image = mh.imread(f, as_grey=True)
    surf_features.append(surf.surf(image)[:, 5:])
    
train_len = int(len(all_instance_filenames) * .60)
X_train_surf_features = np.concatenate(surf_features[:train_len])
X_test_surf_feautres = np.concatenate(surf_features[train_len:])
y_train = all_instance_targets[:train_len]
y_test = all_instance_targets[train_len:]

('Reading image:', 'datasets/train\\cat.0.jpg')
('Reading image:', 'datasets/train\\cat.1.jpg')
('Reading image:', 'datasets/train\\cat.10.jpg')
('Reading image:', 'datasets/train\\cat.100.jpg')
('Reading image:', 'datasets/train\\cat.1000.jpg')
('Reading image:', 'datasets/train\\cat.10000.jpg')
('Reading image:', 'datasets/train\\cat.10001.jpg')
('Reading image:', 'datasets/train\\cat.10002.jpg')
('Reading image:', 'datasets/train\\cat.10003.jpg')
('Reading image:', 'datasets/train\\cat.10004.jpg')
('Reading image:', 'datasets/train\\cat.10005.jpg')
('Reading image:', 'datasets/train\\cat.10006.jpg')
('Reading image:', 'datasets/train\\cat.10007.jpg')
('Reading image:', 'datasets/train\\cat.10008.jpg')
('Reading image:', 'datasets/train\\cat.10009.jpg')
('Reading image:', 'datasets/train\\cat.1001.jpg')
('Reading image:', 'datasets/train\\cat.10010.jpg')
('Reading image:', 'datasets/train\\cat.10011.jpg')
('Reading image:', 'datasets/train\\cat.10012.jpg')
('Reading image:', 'dataset

## Grouping the extracted descriptors into 300 clusters
We use MiniBatchKMeans, a variation of K-Means that uses a random
sample of the instances in each iteration. As it computes the distances to the
centroids for only a sample of the instances in each iteration, MiniBatchKMeans
converges more quickly but its clusters' distortions may be greater. In practice,
the results are similar, and the trade-off is quite acceptable.

In [None]:
n_clusters = 300
print 'Clustering', len(X_train_surf_features), 'features'
estimator = MiniBatchKMeans(n_clusters=n_clusters)
estimator.fit_transform(X_train_surf_features)

## Constructing feature vectors for training and testing data
We find the
cluster associated with each of the extracted SURF descriptors, and count them using
NumPy's binCount() function. The following code will produce a 300-dimensional
feature vector for each instance:

In [None]:
X_train = []
for instance in surf_features[:train_len]:
    clusters = estimator.predict(instance)
    features = np.bincount(clusters)
    if len(features) < n_clusters:
    features = np.append(features, np.zeros((1, n_clusterslen(features))))
    X_train.append(features)
    
X_test = []
for instance in surf_features[train_len:]:
    clusters = estimator.predict(instance)
    features = np.bincount(clusters)
    if len(features) < n_clusters:
    features = np.append(features, np.zeros((1, n_clusterslen(features))))
    X_test.append(features)

## Training a logistic regression classifier on the feature vectors and targets

In [None]:
clf = LogisticRegression(C=0.001, penalty='l2')
clf.fit_transform(X_train, y_train)
predictions = clf.predict(X_test)

## Evaluating our model in terms of Precision, Recall and Accuracy

In [None]:
print(classification_report(y_test, predictions))
print('Precision: ', precision_score(y_test, predictions))
print('Recall: ', recall_score(y_test, predictions))
print('Accuracy: ', accuracy_score(y_test, predictions))

## Summary
In this example we combined clustering with classification in a semi-supervised
learning problem. We learnt features by clustering unlabeled data and used it to
learn features to build a supervised classifier that can classify
images of cats and dogs.

This semi-supervised system has better precision and recall than a logistic regression
classifier that uses only the pixel intensities as features. Moreover, our feature
representations have only 300 dimensions; even small 100 x 100 pixel images would
have 10,000 dimensions and hence would prove to be computationally very expensive.

___