# [Image Analyzer](#section1) 
<a id="section1"></a>
## Example using PySpark and Scikit-learn
* Follow the steps in <a href="https://github.com/ContinuumIO/image-analyzer/blob/master/run_helper.sh">run_helper.sh</a>, while using the default hdfs file naming in <a href="https://github.com/ContinuumIO/image-analyzer/blob/master/config.yaml">config.yaml</a>
* Then run each ipython notebook cell below to look at results
* If the notebook examples do not work, check your config.yaml against the config shown at the bottom of the notebook
* This notebook uses the example faces image <a href="http://cswww.essex.ac.uk/mv/allfaces/faces94.html">data set of Dr. Libor Spacek</a>

In [1]:
%matplotlib inline
from __future__ import print_function, division
from pyspark import SparkConf
from pyspark import SparkContext 
from StringIO import StringIO
import matplotlib.pyplot as plt
import numpy as np
from pprint import pprint

conf = SparkConf()
conf.set('spark.executor.instances', 2)
sc = SparkContext()


#### ['config' below shows the yaml config file loaded for this example](#config)
<a id="config"></a>

In [2]:
import yaml
config = yaml.load(open('config.yaml').read())
config

{'actions': ['map_each_image', 'kmeans', 'find_similar'],
 'candidate_batch': 'c4',
 'candidate_has_mapped': False,
 'candidate_measures_spec': '/t4/candidates/c4/measures',
 'candidate_spec': '/fuzzy/*0.jpg',
 'example_data': '/imgs/',
 'fuzzy_example_data': '/fuzzy/',
 'in_memory_set_len': 8000000,
 'input_spec': '/imgs/*0.jpg',
 'kmeans_group_converge': None,
 'kmeans_output': {'cluster_to_flattened': True,
  'cluster_to_key': True,
  'cluster_to_phash': True,
  'cluster_to_ward': True,
  'flattened_to_cluster': True,
  'flattened_to_key': True,
  'flattened_to_phash': True,
  'key_to_cluster': True,
  'key_to_phash': True,
  'phash_to_cluster': True,
  'phash_to_flattened': True,
  'phash_to_key': True,
  'ward_to_cluster': True,
  'ward_to_key': True},
 'kmeans_sample': 2000,
 'maxIterations': 15,
 'max_iter_group': 3,
 'n_clusters': 12,
 'n_clusters_group': 8,
 'patch': {'max_patches': 4,
  'random_state': 0,
  'window_as_fraction': [0.5, 0.5]},
 'phash_bits': 256,
 'phash_chunk_

### [Example of each image's output](#map_each_image)
<a id="map_each_image"></a>
#### These measurements are done for training and candidate images
#### On each training or candidate image, the measurements may also be applied to patches
#### The relevant code for this is <a href="https://github.com/ContinuumIO/image-analyzer/blob/master/map_each_image.py">map_each_image.py.</a> <i> (The function "example" in map_each_image.py can do these measurements locally on an image file.)</i>
* Kmeans centroids
* Histogram
* Perceptive hashes (abbrev.)
* Ward cluster hashes (abbrev.) (hashing the output of the function seen in <a href="http://scikit-learn.org/stable/auto_examples/cluster/plot_lena_ward_segmentation.html">this scikit demo</a>
* Prinicipal components

In [3]:
example = sc.pickleFile('hdfs:///t1/map_each_image/measures').take(1)[0]
print("Keys:", example[1].keys())
print("Centroids in 1 image flattened:", example[1]['cen'])
print("Histogram flattened:", example[1]['histo'])
print("Perceptive hash (abbrev.):", example[1]['phash'][:5])
print("Ward cluster hash (abbrev.):", example[1]['ward'][:5])
print('PCA factors and variance', example[1]['pca_fac'], example[1]['pca_var'])

Keys: ['cen', 'pca_var', 'meta', 'phash', 'histo', 'ward', 'pca_fac', 'id']
Centroids in 1 image flattened: [ 0.0980597   0.10067332  0.07502033  0.4832288   0.74329001  0.3068403
  0.39556718  0.33680019  0.34318587  0.69543546  0.5883016   0.61652535
  0.97057652  0.86680192  0.84002519  0.26330549  0.2297267   0.24091083
  0.89458674  0.70459908  0.66731501  0.28936416  0.39460137  0.14140551
  0.04144318  0.04623584  0.02837553  0.40374291  0.58303112  0.23085584
  0.53872192  0.4576503   0.44566652  0.16590469  0.16399746  0.13009502]
Histogram flattened: [ 0.02744549  0.03814275  0.05321374  0.16956136  0.45035073  0.50272733
  0.75232422  0.0308008   0.04285328  0.05870309  0.17326318  0.51095122
  0.69764537  0.76385945  0.01479674  0.02571036  0.03825854  0.13062458
  0.30416244  0.36708713  0.64770496]
Perceptive hash (abbrev.): [4590452597640986239, 5705755325505827644, 4316219442648471965, -4779138984269271173, 67322450502461386]
Ward cluster hash (abbrev.): (-3831614152369

### Based on the data above for each image, a kmeans algorithm is run for all training images
#### The kmeans algorithm also tracks the most common perceptive hashes and ward cluster hashes
#### <a href="https://github.com/ContinuumIO/image-analyzer/blob/master/image_mapper.py">image_mapper.py</a> has the iterative kmeans loop on all images
* This shows cluster to hash lookups

In [4]:
print('Kmeans cluster to perceptive hash')
pprint(sc.pickleFile("hdfs:///t1/km/cluster_to_phash").take(2))

print('Kmeans cluster to Ward cluster hash')
pprint(sc.pickleFile("hdfs:///t1/km/cluster_to_phash").take(2))

Kmeans cluster to perceptive hash
[(0, (4590452597640986239, 5705755325505827644)),
 (0, (5705755325505827644, 4316219442648471965))]
Kmeans cluster to Ward cluster hash
[(0, (4590452597640986239, 5705755325505827644)),
 (0, (5705755325505827644, 4316219442648471965))]


### [Also we save the inverse mappings of hash to cluster](#mappings)
<a id="mappings"></a>
#### This provides several ways to search for images by hash or kmeans cluster
#### In <a href="https://github.com/ContinuumIO/image-analyzer/blob/master/config.yaml">config.yaml</a>, see the kmeans_output dictionary that controls which lookup tables are created

In [5]:
print('Perceptive hash to kmeans cluster')
pprint(sc.pickleFile("hdfs:///t1/km/phash_to_cluster").take(2))

print("Ward cluster hash to kmeans cluster")
pprint(sc.pickleFile("hdfs:///t1/km/ward_to_cluster").take(2))

Perceptive hash to kmeans cluster
[((4590452597640986239, 5705755325505827644), 0),
 ((5705755325505827644, 4316219442648471965), 0)]
Ward cluster hash to kmeans cluster
[(-3831614152369362579, 0), (-3831614152369362579, 0)]


### [Hash counts in kmeans clusters](#hash_counts)
<a id="hash_counts"></a>
#### A dictionary for each kmeans cluster counts the most common N hashes per cluster
* This shows the top ward cluster hashes in kmeans cluster with index 0

In [6]:
clust0_ward = sc.pickleFile('hdfs:////t1/km/ward_unions').take(1)[0]
pprint({k:v for k,v in clust0_ward.items()})

{-9095433121909742913: 1,
 -9001240377361059183: 1,
 -8968627659190537093: 1,
 -8917634196879128416: 1,
 -8890094862188566716: 1,
 -8784979837131872929: 4,
 -8576536287508771508: 1,
 -8517683562249245472: 1,
 -8377574415671456776: 1,
 -8265395038234503446: 1,
 -8238206338916862869: 6,
 -7999998493000983909: 1,
 -7864692876566671025: 1,
 -7827924191438386758: 1,
 -7506783778991226981: 1,
 -7434824337793658859: 1,
 -7406090092824014315: 1,
 -6921933845541658490: 1,
 -6855436430839899081: 1,
 -6775669275395669891: 1,
 -6696910335962296665: 6,
 -6573633938772194726: 1,
 -6506769527532925706: 1,
 -6479798263929289639: 3,
 -6439366262116039212: 9,
 -6439362504181703090: 1,
 -6371097860695599930: 4,
 -6249680222990146917: 1,
 -6224159247635402022: 1,
 -6019037222050955989: 2,
 -5806487960647788011: 1,
 -5552806537570066040: 1,
 -5541287140049078438: 1,
 -5460220511711024102: 1,
 -5423295101278253790: 2,
 -5329514752285494194: 1,
 -5207357987965578153: 1,
 -5194032257652069645: 1,
 -5130029528

### Using joins to finally get a matching image name
#### These examples are ward hash to image key and perceptive hash to image key mappings

In [7]:
print('ward_to_key\n')
pprint(sc.pickleFile('hdfs:///t1/km/ward_to_key').take(2))
print('\n\nphash_to_key\n')
pprint(sc.pickleFile('hdfs:///t1/km/phash_to_key').take(2))

ward_to_key

[(-3831614152369362579,
  u'hdfs://ip-10-111-177-131:9000/imgs/femaleasammaasamma.2.jpg'),
 (-3831614152369362579,
  u'hdfs://ip-10-111-177-131:9000/imgs/femaleasammaasamma.2.jpg')]


phash_to_key

[((4590452597640986239, 5705755325505827644),
  u'hdfs://ip-10-111-177-131:9000/imgs/femaleasammaasamma.2.jpg'),
 ((5705755325505827644, 4316219442648471965),
  u'hdfs://ip-10-111-177-131:9000/imgs/femaleasammaasamma.2.jpg')]


### [Joins lead to a number of potentially matching images](#vote_count)
<a id="vote_count"></a>
* The example below shows the number of ward hash chunks matching a candidate
* The candidate has a path name /fuzzy/
* The others are the originals in /imgs/

In [None]:
parts = (config['test_name'], config['candidate_batch'])
ward_matches = sc.pickleFile('hdfs:///%s/candidates/%s/ward_to_key_counts' % parts)
phash_matches = sc.pickleFile('hdfs:///%s/candidates/%s/phash_to_key_counts' % parts)


In [None]:
joined = ward_matches.fullOuterJoin(phash_matches)

def best_votes(x):
    
    if x[1][0] is None:
        d1 = {}
    else:
        d1 = x[1][0][1]
    if x[1][1] is None:
        d2 = {}
    else:
        d2 = x[1][1][1]
    for k in d1:
        if k not in d2:
            d2[k] = d1[k]
        else:
            d2[k] += d1[k]
    d3 = sorted(d2.items(), key=lambda x:x[1])[-1]
    return x[0],(d3,d2)
phash_and_ward = joined.map(best_votes).collect()
print("Example votes dictionary:")
pprint(phash_and_ward[0])

#### [ Loading and comparing historical and matched images ](#matches)
<a id="matches"></a>

In [None]:
def load_image(image):
    """Load one image, where image = (key, blob)"""
    from StringIO import StringIO
    from PIL import Image
    img_quads = []
    img = Image.open(StringIO(image[1]))
    return  image[0], np.asarray(img, dtype=np.uint8)

for p in phash_and_ward:
    cname, candidate = load_image(sc.binaryFiles(p[0]).collect()[0])
    mname, matched = load_image(sc.binaryFiles(p[1][0][0]).collect()[0])
    print("Candidate (fuzzy) " , cname)
    plt.subplot(1,2,1)
    plt.imshow(candidate)
    print("Matched (original) "  ,mname)        
    plt.subplot(1,2,2)
    plt.imshow(matched)
    plt.show()
