# Introduction


This project is included in the field of unsupervised learning, in particular deep clustering, which aims to properly classify unannotated data using classic clustering algorithms, such as kmeans and Birch, and use transfer learning to optimize work.
In this composition, we show in stages the exploration of the data, extracting features, classifying these features and fine tuning our model which is based on the pre-trained **ResNet-50** network.

# Manipulating the Dataset

In this part, we are interested in discovering our public dataset **colorectal_histology**, that classifies textures in colorectal cancer histology. Each example is a 150 x 150 x 3 RGB image of one of 8 classes.

In this context, we will see our dataset informations, and visualize some  samples.

The dataset can be changeable by modifying the next cell, where you can build or import a new data. 

In [None]:
# import and load "colorectal_histology" Dataset :
import tensorflow_datasets as tfds
ds, info = tfds.load('colorectal_histology', data_dir = '/input', split='train', shuffle_files=True, with_info=True)

In [None]:
# Get dataset informations
print(info)

In [None]:
# Show samples of dataset -- discovering
examples = tfds.show_examples(ds, info, rows=5, cols=5)

# Functions, librairies and routines

Here in this section, we will present several packages, modules and functions used to implement our Deep Cluster.

In [None]:
# Import necessary modules and packages:
## import Architecture "ResNet-50"
from tensorflow.keras.applications.resnet50 import ResNet50
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.resnet50 import preprocess_input, decode_predictions
import numpy as np
##
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
##
import keras
from keras.models import Model
##
from tensorflow.keras.preprocessing.image import ImageDataGenerator 
from tensorflow.keras import optimizers
from keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout, InputLayer
from keras.models import Sequential
#
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics.cluster import normalized_mutual_info_score
#
from sklearn.manifold import TSNE
import pandas as pd
import seaborn as sn
#
from google.colab import files
#
from keras.optimizers import SGD
#
from keras import utils
from tensorflow.keras.initializers import RandomNormal
from keras.models import load_model
from keras.utils.np_utils import to_categorical
import random
from keras.utils.np_utils import to_categorical

In [None]:
def clustering(features, clusters, variance = 0.98, alg = "kmeans"):
  '''
  Clustering features to a specific number (int) using one of the two 
  clustering algorithms : kmeans or birch.
  For the speed and efficiency of clustering, the data must go through 
  the PCA step which allows to reduce the main components 
  according to a specific variance (float < 1).

  Parameters:
    features (array): array containing features
    clusters (int): number of clusters to create
    variance(float): variance of pca
    alg (str): two choices -kmeans excuting kmeans alg; else execute birch alg.

  Returns:
    new_labels (array): list of predicted labels after clustering.
  '''
  cl = StandardScaler().fit_transform(features)
  variance = 0.98
  pca = PCA(variance)
  pca.fit(cl) 
  cl = pca.transform(cl)
  if alg =="kmeans" :
    kmeans = KMeans(init = "k-means++", n_clusters = clusters, n_init =25)
    kmeans.fit(cl)
    new_labels = kmeans.labels_ 
  else :
    brc = Birch(n_clusters=clusters).fit(cl)
    new_labels= brc.labels_
  return new_labels


def TSNE_plot(X,labels,ckpt, save=True):
  '''
  TSNE plot to visualize clusters according to labels associated to features.
  This function allows to plot and save a caption of the results

  Parameters:
    X (array): array containing features
    labels (array): array containing labels
    ckpt (int): image title used to save a caption
    save (Bool): choose to save the caption or not

  Returns:
    None
    -> Visualize figure of clusters
  '''
  X=StandardScaler().fit_transform(X)
  tsne_data = TSNE(n_components=2,random_state=0, perplexity=50.0).fit_transform(X)
  tsne_data = np.vstack((tsne_data.T,labels)).T
  tsne_df = pd.DataFrame(data=tsne_data, columns=("Dim_1", "Dim_2", "label"))
  sn.FacetGrid(tsne_df, hue="label", size = 6).map(plt.scatter,'Dim_1', 'Dim_2' ).add_legend()
  plt.savefig("%d.png" %ckpt)
  if save:
    files.download("%d.png" %ckpt)
  plt.show()


def most_frequent(List): 
  '''
  give the most frequent element of a list.

  Parameters:
    List (array): array, object of study

  Returns:
    num (type of the elements in List): most frequent element.
  '''
  counter = 0
  num = List[0] 
      
  for i in List: 
      curr_frequency = List.count(i) 
      if(curr_frequency> counter): 
        counter = curr_frequency 
        num = i 
  
  return num 


def samples(id_cluster, klabels, labels, images, pp)
  '''
  Visualize some samples of a predicted class and show their original cluster.

  Parameters:
    id_cluster (int): array, object of study
    klabels (array): predicted labels
    labels (array): true labels
    pp (int) : number of samples
    images : images of dataset
  Returns:
    None
    -> show samples
  '''
  cluster_members = (labels == id_cluster)
  fig = plt.figure(figsize=(10,10));
  m=0
  p=1
  while True : 
    if cluster_members[m]:
      img= images[m] 
      label = list(labels[m]).index(1)
      fig.add_subplot(5,5,p) 
      plt.imshow(img)
      plt.title(str(label))
      p+=1
    m+=1
    if p > pp*pp:
      break
  plt.show()

def map(a,b, real_labels, dict, NUM_CLUSTER):
  '''
  from a given predicted and real labels, we create an equivalent array containing matches of real labels
  (transform predicted labels to real labels == bijective application)

  Parameters:
    a (int): start index
    b (int): end index
    real_labels (array): array containing labels
    dict (str): two choices -kmeans excuting kmeans alg; else execute birch alg.
    NUM_CLUSTER (int):
  Returns:
    map (list): list of predicted labels after clustering.
  '''
  map =[]
  for elm in real_labels[a:b]:
    for i in dict.keys() : 
      if dict[i] == elm : 
        map.append(i)
  return map
  

def dict_init(real_labels, new_labels, NUM_CLUSTER):
  '''
  Search the equivalents of the predicted labels with respect to the real labels.

  Parameters:
    real_labels (array): array containing original labels
    new_labels (int): array containing predicted labels
    NUM_CLUSTER (int): number of clusters

  Returns:
    dict (dict): dict object contains label matches.
  '''
  dict = {}
  used = []
  for i in range(NUM_CLUSTER) :
    members = (new_labels == i)
    clus = [ real_labels[i] for i in range(len(new_labels)) if members [i]]
    ks = dom(clus)
    j = 0
    while ks[j] in used:
      if j < len(ks) -1 :
        j+=1
      else :
        l = [i for i in range(NUM_CLUSTER) if not(i in used)]
        val= list(dict.values()).index(ks[j])
        dict[val] = l[0]
        used.append(l[0])
        used.remove(ks[j])
    dict[i] = ks[j]
    used.append(ks[j])
  return dict  


def dom(liste): 
  '''
  List the elements of the list according to their decreasing order of appearance.

  Parameters:
    liste (list): list, object of study

  Returns:
    res (list): list sorted by order of appearance.
  '''
  res = []
  l = liste
  while len(l)>=1 :
    m = most_frequent(l)
    res.append(m)
    l = list(filter(lambda x: x != m, l))
  return res

# Image Pre-processing

At this point, we need to generate tensors. This task can be done by ImageDataGenerator, that generates batches of tensor image data with real-time pre-processing. we need to mention that we send the original images to the model, we just scale the image pixels between 0 and 1 and do not apply any transformations.



In [None]:
# absolute path for our dataset : 
data_path = '/input/downloads/extracted/ZIP.zeno.org_reco_5316_file_Kath_text_2016_imaqL7TPMR0wf27knUqk31h7Z3Aye3ukvUAeDFu7zhZbcQ.zip/Kather_texture_2016_image_tiles_5000'

# Rescaling -- Data augmentation :
# We need scale the image pixels between 0 and 1.
train_datagen = ImageDataGenerator(rescale=1./255)
# And then, we generate batches of tensor image.
train_generator = train_datagen.flow_from_directory(
        data_path,
        target_size=(150, 150),
        batch_size= 20,
        class_mode= 'categorical')

# Deep Cluster : **Iteration 0**

In this section, we will implement the first iteration, named "iteration 0". This iteration is a set up to the main program that build our deep cluster model.

The first iteration is about creating a model is based the pre-trained neural network **ResNet-50**, using random weights. 
Learning models will be used in the next section, where we will improve our neural network.

In [None]:
# pre trained model "ResNet-50"
resnet = ResNet50(include_top=False, input_shape=(150,150,3), weights='imagenet', pooling='avg')
resnet.summary()

In [None]:
# Extracting features with RESNET-50  pre-trained model :

sample_count = 5000
batch_size = 20
features = np.zeros(shape=(sample_count, 2048))
images = np.zeros(shape=(sample_count, 150, 150, 3))
labels = np.zeros(shape=(sample_count, 8))
i=0
for inputs, lab in train_generator:
    images[i*batch_size:(i+1)*batch_size] = inputs
    labels[i*batch_size:(i+1)*batch_size] = lab
    features[i*batch_size:(i+1)*batch_size]= resnet.predict(inputs)
    i+=1
    if i*batch_size >= sample_count : 
      break



In [None]:
# Get predicted labels from new extracted features:
new_labels = clustering(features, 8)

In [None]:
# Visualization : TSNE plot used to see predicted clusters
TSNE_plot(features,new_labels, 0)
# Note : results are random and not ordered as expected from iteration °0.

In [None]:
# Adding the Dense Classifier of the Cluster Assignements

x = resnet.output
x = Flatten(name='flatten')(x)
#x = Dense(128,activation='relu', kernel_initializer=RandomNormal(mean=0.0, stddev=0.001))(x)
x = Dropout(0.5)(x)
x = Dense(8, activation='softmax', name='fc8', kernel_initializer=RandomNormal(mean=0.0, stddev=0.001))(x)

net = Model(inputs=resnet.input, outputs=x)

net.trainable = False
net.save('checkpoint/0.ckpt') 

# Deep Cluster : **Main algorithm**

This section is about fine tuning our neural network.
One may build a dataset (images, Cluster assignments) that we need to fine-tune our model; where on top of which we add some dense layers to perform the task of classification. In addition, for fine-tuning the network, in order to establish a good behavior off fully-connected layers on the top, we wil Generate weights of the dense layer using the Kernel initializer of which we choose the distribution (Normal distribution with a low
standard deviation and a mean equal to zero and training them both at the same time using the SGD optimizer with a low learning rate. 

The stop criterion of this algorithm is purely is exprimental
and based on TSNE plot, confusion matrix and accuracies calculated from this matrix, normally in the order of 10 to 15 iterations.


In [None]:
#Our Training/Validation Data
train_images = images[1000:]
train_labels = to_categorical(new_labels[1000:])
test_images = images[:1000]

# learning
true_labels = np.argmax(labels, axis = -1)
START = 1
END = 15
BATCH_SIZE = 20
variance = 0.98
NUM_CLUSTER = 8
NUM_EPOCH = 20
nmi_score = []
confusion_score=[]

for ckpt in range(START, END+1):
  previous_labels = new_labels
  previous_model = load_model('checkpoint/%d.ckpt'%(ckpt-1))
  net = Model(inputs=previous_model.input, outputs=previous_model.get_layer('avg_pool').output)
  
  # extract features
  new_features = []
  for img in images:
    img = image.img_to_array(img)
    img = np.expand_dims(img, axis=0)
    new_features.append(np.squeeze(net.predict(img),axis=0))
  new_features = np.array(new_features)

  # clustering
  new_labels = clustering(new_features, 8)
  train_labels = to_categorical(new_labels[1000:])
  #Vizualising the clusters :
  TSNE_plot(new_features,new_labels,ckpt) # to visualize predicted clusters 
  # TSNE_plot(new_features,true_labels,ckpt) # to visualize clusters according to their real labels
  nmi_score.append(normalized_mutual_info_score(new_labels,previous_labels))
  
  # retrain: fine tune
  init_model = load_model('checkpoint/0.ckpt')
  x = init_model.get_layer('avg_pool').output
  x = Flatten(name='flatten')(x)
  x = Dense(128, activation='relu', kernel_initializer=RandomNormal(mean=0.0, stddev=0.001))(x)
  x = Dropout(0.5)(x)
  x = Dense(NUM_CLUSTER, activation='softmax', name='fc8', kernel_initializer=RandomNormal(mean=0.0, stddev=0.001))(x)
  net = Model(inputs=init_model.input, outputs=x)
  
  for layer in net.layers:
    layer.trainable = True

  net.compile(loss='categorical_crossentropy',
              optimizer=SGD(lr=0.001, momentum=0.9),
               metrics=['acc'])
  
  net.fit(train_images, train_labels, batch_size = BATCH_SIZE,epochs = NUM_EPOCH)
  
  # Matrice de confusion
  init_dicct = dict_init(true_labels, new_labels, NUM_CLUSTER)
  cm=confusion_matrix(new_labels[:1000],map(0,1000,true_labels, init_dicct, NUM_CLUSTER))
  sn.heatmap(data=cm,fmt='.0f',xticklabels=range(NUM_CLUSTER),yticklabels=range(NUM_CLUSTER),annot=True)

  # score de performance
  confusion_score.append(accuracy_score(new_labels[3000:],map(3000,5000,true_labels,init_dicct, NUM_CLUSTER)))
  #print('Average precision-recall score: {0:0.2f}'.format(average_precision_score(new_labels[3000:],map(3000,5000,true_labels,init_dicct, NUM_CLUSTER)))
  
  #Save the model : 

  net.save('checkpoint/%d.ckpt'%ckpt)

# Plot nmi :
epochs = range(1, len(nmi_score) + 1)

plt.plot(epochs, nmi_score, 'b', label='nmi score')
plt.title('nmi score t/t-1')

plt.legend() 
plt.show()

plt.figure() 
plt.plot(epochs, confusion_score, 'b', label='Test loss')
plt.title('Clustering accuracy score')
plt.legend()

plt.show() 


# Conclusion

In this project, we succeeded in constructing a deep cluster model, which allows to classify the images well as shown by the results in the section.

The results were generally good and satisfactory. Our design can be improved by doing the similarity testing to select reliable training samples in order to avoid redundancy in our model.

Generally, the deep clustering task was successfully as the model was able to adapt and recognize clusters of different type of images. Therfore, deep
learning and domain adaptation are considerably the key to improve the field of pathology AI, especially, for pathology data that has not yet been
processed.
