###### Clustering

Clustering is an interesting field of Unsupervised Machine learning where we classify 
datasets into set of similar groups. It is part of ‘Unsupervised learning’ meaning, where
there is no prior training happening and the dataset will be unlabeled. Clustering can be
done using different techniques like K-means clustering, Mean Shift clustering, DB Scan 
clustering, Hierarchical clustering etc. 

###### Image clustering


Image clustering is an essential data analysis tool in machine
learning and computer vision. Many applications
such as content-based image annotation and
image retrieval can be viewed as different instances
of image clustering. Technically, image clustering
is the process of grouping images into clusters such that the
images within the same clusters are similar to each other,
while those in different clusters are dissimilar.

In [54]:
import os
# Code: import Kmeans library from sklearn ( 1 point)
from sklearn.cluster import KMeans


###### VGG 

VGG is a convolutional neural network model for image recognition proposed by the Visual Geometry Group in the University of Oxford, where VGG16 refers to a VGG model with 16 weight layers, and VGG19 refers to a VGG model with 19 weight layers. The architecture of VGG16: the input layer takes an image in the size of (224 x 224 x 3), and the output layer is a softmax prediction on 1000 classes. From the input layer to the last max pooling layer (labeled by 7 x 7 x 512) is regarded as the feature extraction part of the model.

In [64]:
from keras.preprocessing import image
from keras.applications.vgg16 import VGG16
from keras.applications.vgg16 import preprocess_input
import numpy as np

model = VGG16(weights='imagenet', include_top=False)    

img_path = ('/Users/pramanikpramanik/Desktop/Machine Learning/dataset/train_dataset/african-lionadapt19001JPG.jpg')
# Code: Specify path of the random image from the training dataset. (1 point)
img = image.load_img(img_path, target_size=(224, 224)) 
img_data = image.img_to_array(img)
img_data = np.expand_dims(img_data, axis=0)

vgg16_feature = model.predict(img_data)  


print('dimensions: ',np.ndim(vgg16_feature))
print('shape: ',np.shape(vgg16_feature))

# Code: print the shape of the vgg16_feature  (1 point)

dimensions:  4
shape:  (1, 7, 7, 512)


In [65]:
# The given function will extract the features from the images.

# 2 changes - one for importing file and handling exceptions, another for adding filenames and returing the array.

def extract_feature(directory):
    vgg16_feature_list = []
    filename_arr = [] # extra for saving filenames
    for filename in os.listdir(directory):
      if not filename.endswith('.DS_Store'): # adding extra code for mac os 
          try:  
            #print (filename)
            img = image.load_img(os.path.join(directory,filename), target_size=(224, 224))
            filename_arr.append(filename) 
            # extra code for saving filenames to be used later to copy image to respective cluster folders
            img_data = image.img_to_array(img)
            img_data = np.expand_dims(img_data, axis=0)
            img_data = preprocess_input(img_data)

            vgg16_feature = model.predict(img_data)
            vgg16_feature_np = np.array(vgg16_feature)
            vgg16_feature_list.append(vgg16_feature_np.flatten())
          except:
            pass
    vgg16_feature_list_np = np.array(vgg16_feature_list)
    filename_arr_np = np.array(filename_arr)
    
    return vgg16_feature_list_np,filename_arr_np # returning filenames array as well

The given dataset has three classes that are: Lion , Fish and Zebra, but we are not providing any 
    supervision to the model i.e. we are not specifying which image is associated with which
    class / cluster. For this we using unsupervised image clustering to create the clusters.

In [66]:
train_feature_vector, f_train = extract_feature("/Users/pramanikpramanik/Desktop/Machine Learning/dataset/train_dataset")  
# pass the path of the folder where you have the training dataset

kmeans_model = KMeans(n_clusters=3, random_state=0)
# Code: create the kmeans object and initialize it with the number_of_clusters = 3   (2 point)
kmeans_model.fit(train_feature_vector) 
   





KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=0, tol=0.0001, verbose=0)

In [67]:
# create a test vector using extract_feature function. It will return a feature vector of size 
# number of images * size of the feature vector

# retruning extra array for file names
test_vector, f_test = extract_feature('/Users/pramanikpramanik/Desktop/Machine Learning/dataset/test_dataset')  # (1 point)

In [68]:
# Code: print the shape of the test vector   # (1 point)
print('dimensions: ',np.ndim(test_vector))
print('shape: ',np.shape(test_vector))

dimensions:  2
shape:  (32, 25088)


In [52]:
labels = kmeans_model.predict(test_vector)
print(labels)
# Code: use the kmeans model to predict the labels for the test vector (1 point)

[2 2 1 2 0 0 0 0 0 0 2 2 2 0 2 0 0 0 0 0 0 0 0 0 0 2 0 0 1 0 1 1]


In [53]:
# Code: Using the labels and the images, save the test images in the different folders in respective 
# clusters.   (2 point)

# added extra code while generating features to save filenames to be referred here.

# folders has been names 0/1/2 because not sure which one is zebra/fish/lion
# after copying - 0 is fish, 1 is zebra, 2 is lion

from shutil import copyfile
label_len = len(labels)
testhome = '/Users/pramanikpramanik/Desktop/Machine Learning/dataset/'
for i in range(label_len):
    copyfile(testhome+'test_dataset/'+str(f_test[i]), testhome+'/result/'+str(labels[i])+'/'+f_test[i])