In [1]:
import pandas as pd
from ivpy import attach,show,montage,histogram,scatter,compose
from ivpy.extract import extract
from ivpy.reduce import pca,tsne,umap
from ivpy.cluster import cluster

In [None]:
DIR = "/mnt/e/Tasks/similarity/Images/jpg/"
df = pd.read_csv("oxfordflower.csv")

In [None]:
df.filename = [DIR+item for item in df.filename]

In [None]:
attach(df,'filename')

# extract( )

I included basic color features in oxfordflower.csv, to show off the plotting functions. But now we can extract those features ourselves with `extract()`. All you need to pass to extract is the image filepaths, and a keyword telling ivpy which feature to extract. Currently, the options are 'brightness', 'saturation', 'hue', 'entropy', 'std', 'contrast', 'dissimilarity', 'homogeneity', 'ASM', 'energy', 'correlation', 'neural', 'tags', or 'dmax'. Since we've already looked at the HSV properties, let's check out the others.

In [None]:
df['entropy'] = extract('entropy')

In [None]:
montage(xcol='entropy',shape='circle',ascending=True)

There are better datasets for illustrating entropy, but by looking at the middle of the plot, we can see that low entropy images are "simpler" or more "minimalist", and the high entropy images at the outer edge are "noisier". I won't march through all of the examples, but entropy, standard deviation, contrast, dissimilarity, homogeneity, ASM, energy, and correlation are all texture properties derived from the gray-level co-occurrence matrix (GLCM). For more information, visit the scikit-image page: https://scikit-image.org/docs/dev/auto_examples/features_detection/plot_glcm.html

## Neural net similarity

Neural nets are all the rage now, and a neural net measure of similarity can be a great starting point for exploring your image collections. Moreover, because the neural net extractor delivers a high-dimensional vector, rather than a single number, we can test out our dimension reduction and clustering algorithms. Ivpy's neural net vector is the output of the penultimate layer of ResNet50: https://arxiv.org/abs/1512.03385

In [None]:
X = extract('neural')

First, let's normalize the vector space using extract.norm:

In [None]:
from ivpy.extract import norm

In [None]:
X = norm(X)

# cluster()

First, let's start by clustering in the high-dimensional space. Since there are 17 flower names, let's set the number of clusters to 17. If we pass no keyword argument for 'method', the method will be k-means:

In [None]:
df['cluster_kmeans_17'] = cluster(X,k=17)

In [None]:
montage(facetcol='cluster_kmeans_17')

Not too bad! Let's see how well this purely visual clustering accords with the actual flower names:

In [None]:
from sklearn.metrics import adjusted_rand_score as adjrand

In [None]:
d = dict(zip(list(df.flowername.unique()),list(range(len(df.flowername.unique())))))

In [None]:
df['flower_number'] = [d[item] for item in df.flowername]

In [None]:
adjrand(df.flower_number,df.cluster_kmeans_17)

The adjusted rand score measures the similarity of two clusterings. A score of 1.0 means a perfect match, and random is about zero. So, 0.47 is not too bad! This means that there is some flower name signal in the purely visual data, but it does not provide a perfect discriminator. For fun, we could zoom into a heterogeneous cluster to see what's going on:

In [None]:
for cluster_number in df.cluster_kmeans_17.unique():
    tmp = df.flowername[df.cluster_kmeans_17==cluster_number]
    n = len(tmp.unique())
    print(cluster_number,":",n)

Let's look at cluster 3. From the plot above, we can see it looks pretty visually unified, but there are 10 different flower names in there!

In [None]:
montage(pathcol=df.filename[df.cluster_kmeans_17==3],notecol=df.flowername[df.cluster_kmeans_17==3])

Ah ok. In addition to daffodils, there are yellow tulips, sunflowers, cowslips, buttercups, and more.

# reduce

Ok that was fun. But now let's just look at all the images on the same plotting canvas, by reducing the 2048 dimensions of the ResNet50 vector down to 2. We can actually look at 3 different algorithms, and can make a triptych of plots with a loop:

In [None]:
plotlist = []
for func in [pca,tsne,umap]:
    df[['x','y']] = func(X,n_components=2)
    plotlist.append(scatter('x','y',side=800,xbins=40,ybins=40,thumb=20))
compose(*plotlist,ncols=3,border=True)

The t-SNE plot (middle) looks pretty interesting. Let's take a closer look. We will remove the gridding:

In [None]:
df[['x','y']] = tsne(X)
scatter('x','y',side=4000,thumb=64)

Looks good! We have some pretty clear flower neighborhoods here, and maybe have the very beginnings of a simple K-Nearest-Neighbors flower classifier?

Just a note on the reducing functions: they all use the scikit-learn API, including keywords. Ivpy is really just a thin wrapper around scikit-learn's already concise API. Usually, these functions will default to 2 dimensions, but you must pass `n_components=2` to `pca()`; it's default is 1360.