First, we must import the needed modules to load and work with the data

In [1]:
import numpy as np
from scipy.io import loadmat
import matplotlib.pyplot as plt
import pandas as pd
import matplotlib.colors as colors

Load data containing users confusion matrices, and prior information on each image. 

In [2]:
retired_images = loadmat('retired_images.mat')
conf_matrices = loadmat('conf_matrices.mat')
PP_matrices = loadmat('PP_matrices.mat')

Format data into a 'pandas' data structure to make it more functional in python

In [3]:
tmpPP  = []
tmpPP1 = []
for iN in range(PP_matrices['PP_matrices'][0].size):
    tmpPP.append(PP_matrices['PP_matrices'][0][iN]['imageID'][0][0])
    tmpPP1.append( PP_matrices['PP_matrices'][0][iN]['matrix'])


tmpCM  = []
tmpCM1 = []
for iN in range(conf_matrices['conf_matrices'].size):
    tmpCM.append(conf_matrices['conf_matrices'][iN]['userID'][0][0][0])
    tmpCM1.append(conf_matrices['conf_matrices'][iN]['conf_matrix'][0])


tmpRI  = []
for iN in range(retired_images['retired_images'].size):
    tmpRI.append(retired_images['retired_images'][0][iN]['imageID'][0][0])


conf_matrices  = pd.DataFrame({ 'userID' : tmpCM,'conf_matrix' : tmpCM1})
retired_images = pd.DataFrame({ 'imageID' : tmpRI})
PP_matrices    = pd.DataFrame({ 'imageID' : tmpPP,'pp_matrix' : tmpPP1})

Format image batches into 'pandas' data structure

In [4]:
for i in range(1,2):
    batch_name = 'batch' + str(i) + '.mat' #batch1.mat, batch2.mat, etc
    batch = loadmat(batch_name) #read batch file
    tmpType         = []
    tmpLabels       = []
    tmpuserIDs      = []
    tmpTruelabel    = []
    tmpImageID      = []
    tmpML_posterior = []
    # Subtracting 1 off the index from the mat file for the "labels" so that the indexing works in python. 
    for iN in range(batch['images'].size):
        tmpType.append(batch['images'][iN]['type'][0][0])
        tmpLabels.append(batch['images'][iN]['labels'][0][0]-1)
        tmpuserIDs.append(batch['images'][iN]['IDs'][0][0])
        tmpTruelabel.append(batch['images'][iN]['truelabel'][0][0][0]-1)
        tmpML_posterior.append(batch['images'][iN]['ML_posterior'][0][0])
        tmpImageID.append(batch['images'][iN]['imageID'][0][0][0])

    images = pd.DataFrame({'type' : tmpType,'labels' : tmpLabels, 'userIDs' : tmpuserIDs, 'ML_posterior' : tmpML_posterior, 'truelabel' : tmpTruelabel, 'imageID' : tmpImageID}) #store formatted data in 'images' structure



Lets talk boout the conf_matrices.mat file that stores the confusion matrices of all the users. This information (stored in a .datfile) will be a NX1 array where N is the number of users. in each row we have a cell array that contains the CXC "confusion matrix" for that user. A perfectly skilled user would only have values on the diagonal of this matrix and all off diagonal values indicate wrong answers were given to one category or another when presented with a 'G' true labelled image. To illustrate this we'll print out one users confusion matrix. 

In [5]:
print conf_matrices['conf_matrix'][0] #print confusion matrix of first user

[[191   3   5   2   5   3   3   6   3   1   2   1   1   5   1]
 [  4 184   1   4   5   4   1   3   5   1   5   4   1   4   5]
 [  3   3 217   4   2   2   4   2   1   1   3   2   4   3   1]
 [  4   2   2 213   2   2   4   5   2   1   1   3   4   2   4]
 [  4   1   2   3 210   4   3   1   1   5   4   5   3   4   2]
 [  5   1   3   1   2 220   1   5   5   1   5   5   2   4   5]
 [  5   4   4   3   3   3 181   1   3   1   1   3   3   5   3]
 [  2   4   5   2   5   5   2 198   4   3   4   4   2   5   1]
 [  3   1   4   3   4   1   1   5 210   5   5   2   5   4   1]
 [  4   4   2   4   5   1   1   4   3 188   1   2   1   2   4]
 [  2   3   5   2   3   5   1   1   1   1 194   5   2   3   4]
 [  2   1   5   2   3   2   1   2   4   1   4 215   1   5   4]
 [  1   5   5   5   5   1   1   2   5   4   1   1 213   3   2]
 [  3   4   5 214   4   2 183 199   1   6   6 216 214   4   5]
 [  4   4   5   4   3   5   1   3   4   2   5   5   2   3 192]]


For example, the value in the first row, second column is a 3. This indicates that the user classified 3 images in class 2, when really they should be in class 1. Confusion matrices are used to evaluate users skill level, and are updated as images are retired. 

Here's a good visualization of the matrix in heat map form:

In [36]:
data = conf_matrices['conf_matrix'][0]
plt.matshow(data, cmap='Blues', norm=colors.LogNorm(vmin=data.min(), vmax=data.max()))
plt.colorbar()
plt.xlabel('classes')
plt.ylabel('classes')
plt.title('Visualization of confusion matrix')
ax = plt.gca()
ax.set_xticks(np.arange(0,15,1))
ax.set_yticks(np.arange(0,15,1))
plt.show()



Looking down the diagonal we can see that this user was not the best at classifying images belonging to class 13. If the user were perfect, he or she would have a dark blue diagonal, and whitespace everywhere else. 

Now, we will define the main function that evaluates the batches of images labeled by users. The Zooniverse server (specifically Nero https://github.com/zooniverse/nero) will continuously send data (image batches) to be evaluated. 

Lets look at some of that data for one image:

In [37]:
print images.iloc[0,:] #accesses all of the first images information


ML_posterior    [0.0143179197288, 0.0143179197288, 0.014317919...
imageID                                                        30
labels          [13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 1...
truelabel                                                      -2
type                                                            T
userIDs         [18, 29, 24, 22, 30, 14, 13, 16, 1, 9, 7, 27, ...
Name: 0, dtype: object


Okay, thats a lot of information for one image. Lets breakdown what this data actually stores, and how to access certain parts.

The ID of the users, the ID of the image they classified, and 
the classification made by that user for that image are all held in the 'images' structure. Lets access them.

In [39]:
Sample_ImageID = images.iloc[0,:]['imageID']
Sample_UserIDs = images.iloc[0,:]['userIDs']
Sample_labels = images.iloc[0,:]['labels']

print Sample_ImageID
print images['imageID'][0], "\n" 

print Sample_UserIDs
print images['userIDs'][0], "\n"

print Sample_labels
print images['labels'][0], "\n"

#Two different ways to access the same image data

30
30 

[18 29 24 22 30 14 13 16  1  9  7 27 23  2 21 11 12 15 28  5 25  8 20]
[18 29 24 22 30 14 13 16  1  9  7 27 23  2 21 11 12 15 28  5 25  8 20] 

[13 13 13 13 13 13 13 13 13 13 13 13 13 13  2 12  5 12 10  4 10  4 11]
[13 13 13 13 13 13 13 13 13 13 13 13 13 13  2 12  5 12 10  4 10  4 11] 



The first number is the Image ID corresponding to a specific image.

The second array holds the user ID's that have classified the image.

The third array holds the classifications made by the users corresponding (one to one) to the second array.

The 'images' structure also contains an images Type, Machine Learning (ML) Posterior, and a True label for each image:

THE TYPE - A label (string) either 'T' or 'G' to determine if it is a ML classified label ('T') or a pre-labelled "golden" image ('G')

ML POSTERIOR- An array (double) of a  1XC row vector where C is the number of pre-determined morphologies that the classifier has been trained on. Each column is the ML confidence (percentage) that the image belongs in one of the C classes.

TRUE LABEL - (int) For images labelled 'T' this values is set to -1 but for images labelled 'G' This value indicates the "true" class that this image belongs in for the purposes of comparing a citizens classification with this true label.

Lets access these elements of 'images'

In [40]:
Sample_type = images.iloc[0,:]['type']
Sample_ML_Posterior = images.iloc[0,:]['ML_posterior']
Sample_TrueLabel = images.iloc[0,:]['truelabel']

print Sample_type
print images['type'][0], "\n"

print Sample_ML_Posterior
print images['ML_posterior'][0], "\n"

print Sample_TrueLabel
print images['truelabel'][0]

#Two different ways to access the same image data

T
T 

[ 0.01431792  0.01431792  0.01431792  0.01431792  0.01431792  0.01431792
  0.01431792  0.01431792  0.01431792  0.01431792  0.01431792  0.01431792
  0.01431792  0.79954912  0.01431792]
[ 0.01431792  0.01431792  0.01431792  0.01431792  0.01431792  0.01431792
  0.01431792  0.01431792  0.01431792  0.01431792  0.01431792  0.01431792
  0.01431792  0.79954912  0.01431792] 

-2
-2


What does all this information mean?

This image has type 'T', meaning it is NOT a golden image and is still in testing.

From the ML_Posterior, we can see that the Machine is 79.95% sure this image is in the 14th class, and  1.43% sure it is in each other class.

A true label of -2 means this image is still testing and has not been assigned a class.

Now that we understand the structure of 'images' and how to access specific image information, lets move on to evaluating image and user classifications.

Initialize R_lim, the limit on how many people can look at an image before it is passed onto a higher skill level

In [43]:
R_lim = 23

Initialize N, the number of images in a batch

In [44]:
N = images['type'].size

initialize C, the number of morphologies (classes)

In [45]:
 for i in range(N):
        if images['type'][i] == 'T':
            C = images['ML_posterior'][i].size
            break

Initialize a flat prior. Essentially this means before any more information is known, we assume each image has an equal probability of being in each of the 15 classes

In [46]:
priors = np.ones((1,C))

Initialize varaible t. 

t is a CX1 column vector where C is the number of pre-determined morphologies and where each row is the predetermined certainty threshold that an image must surpass to be considered part of class C. Here all classes have the same threshold but in realty different categories will have more difficult or more relaxed thresholds for determination of class and, as a result, 
retirability.

In [47]:
t = .4*np.ones((C,1))

initialize a decision matrix that holds the decision for each image

In [48]:
 dec_matrix = np.zeros((1,N))

initialize a class matrix that holds the True labels of each image. (corresponds one to one to decision matrix)

In [49]:
class_matrix = np.zeros((1,N))

initialize a list to hold the pp_matrices for each images. We'll talk about the importance of pp_matrices later. Note that pp_matrix and posterior matrix may be used interchangeably. 

In [50]:
pp_matrices_rack = []

Now lets look at how a decision for an image is made. The decider function takes in an images posterior matrix, Machine learning decision, number of annotators, and R_lim as arguments, and uses that information to decide the next step for the image.

In [51]:
def decider(pp_matrix, ML_dec, t, R_lim, num_annotators): #define the decider function with given arguments
    pp_matrix2 = np.hstack((pp_matrix, ML_dec.reshape((15,1)))) #Include ML_decision in posterior matrix
    v = np.sum(pp_matrix2, axis=1)/np.sum(np.sum(pp_matrix)) #create vector of normalized sums of pp_matrix2 
    maximum = np.amax(v) #initialize maximum, max value of v
    maxIdx = np.argmax(v) #initialize maxIdx, index of max value of v

    if maximum >= t[maxIdx]: #if maximum is above threshold for that specific class

        decision = 1 #retire the image
        print('Image is retired')

    elif num_annotators >= R_lim: #if more than R_lim annotators have looked at image and no decision reached

        decision = 2 #pass image on to next user skill class
        print('Image is given to the upper class')

    else: #if fewer than R_lim annotators have looked at image

        decision = 3 #keep image in same class
        print('More labels are needed for the image')

    image_class = maxIdx #set image_class 

    return decision, image_class #return the decision, and image class

The next chunk of code is used to update users confusion matrices, images posterior matrices (PP_matrix), and to make decisions on the future of an image.

In [None]:
for i in range(N):

        if images['type'][i] == 'G': #check if golden set image
            labels  = images['labels'][i] #take citizen labels of image
            userIDs = images['userIDs'][i] #take IDs of citizens who label image
            tlabel  = images['truelabel'][i] #take true label of image

            for ii in range(userIDs.size): #iterate over user IDs of image

                indicator = 0 #initialize indicator to zero 

                for cc in range(len(conf_matrices)): #iterate over confusion matrices

                    if userIDs[ii] == conf_matrices['userID'][cc]: #if user is already registered

                        conf_matrix = conf_matrices['conf_matrix'][cc] #take confusion matrix of citizen
                        conf_matrix[tlabel,labels[ii]] += 1 #update confusion matrix
                        conf_matrices['conf_matrix'][cc] = conf_matrix #confusion matrix put back in stack
                        indicator = 1

                if indicator == 0: #if user not registered

                    dummy_matrix = np.zeros((C,C)) #create dummy matrix
                    dummy_matrix[tlabel,labels[ii]] += 1 #update dummy matrix
                    tmp = pd.DataFrame({ 'userID' : [userIDs[ii]],'conf_matrix' : [dummy_matrix]},index = [len(conf_matrices)]) #create new users confusion matrix
                    conf_matrices = conf_matrices.append(tmp) #append new users confusion matrix to stack

            dec_matrix[0,i] = 0 #since it is a training image, no decision is made
            class_matrix[0,i] = tlabel #class of image is its true label
            print('The image is from the training set')

        else: #if image not in golden set, i.e. has ML label but no true label

            indicator1 = 0

            for kk in range(retired_images.size): #loop over retired images

                if images['imageID'][i] == retired_images['imageID'][kk]: #if image is retired
                    indicator1 = 1
                    dec_matrix[0,i] = -1 #give invalid decision
                    break

            if indicator1 == 0: #if image is not retired

                labels           = images['labels'][i] #take citizen labels of image
                userIDs          = images['userIDs'][i] #take IDs of citizens who label image
                num_annotators   = labels.size #define number of citizens who annotate image
                ML_dec           = images['ML_posterior'][i] #take ML posteriors of image
                imageID          = images['imageID'][i] #take ID of image
                image_prior      = priors #set priors for image to original priors

                for y in range(len(PP_matrices)): #iterate over posterior matrices

                    if imageID == PP_matrices['imageID'][y]: #find posterior matrix for the image
                        image_prior = np.sum(PP_matrices['pp_matrix'][y],axis=1)/np.sum(PP_matrices['pp_matrix'][y]) #if image labeled but not retired, PP_matrix information is used in the place of priors
                        break

                for k in range(num_annotators): #iterate over citizens that labeled image
                    for iN in range(len(conf_matrices)): #iterate over confusion matrices

                        if userIDs[k] == conf_matrices['userID'][iN]: #find confusion matrix corresponding to citizen
                            
                            conf = conf_matrices['conf_matrix'][iN] #take confusion matrix of citizen
                            break

                    conf_divided,x,z,s = np.linalg.lstsq(np.diag(sum(conf,2)),conf) #calculate p(l|j) value

                    for j in range(C): #iterate over classes

                        pp_matrix = np.zeros((C,num_annotators)) #create posterior matrix
                        pp_matrix[j,k] = (conf_divided[j,labels[k]]*priors[0][j])/sum(conf_divided[:,labels[k]]*priors[0]) #calculate posteriors
                pp_matrices_rack.append(pp_matrix) #assign values to pp_matrices_rack


                dec_matrix[0,i], class_matrix[0,i] = decider(pp_matrix, ML_dec, t, R_lim, num_annotators) #make decisions for each image in batch


Image is retired


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


The image is from the training set
Image is retired
Image is retired
Image is retired
Image is retired
Image is retired
Image is retired
Image is retired
Image is retired
Image is retired
Image is retired
Image is retired
Image is retired
Image is retired
Image is retired
Image is retired
Image is retired
Image is retired
The image is from the training set
The image is from the training set
Image is retired
Image is retired
Image is retired
Image is retired
Image is retired
Image is retired
Image is retired
Image is retired
Image is retired
Image is retired
Image is retired
Image is retired
Image is retired
Image is retired
Image is retired
Image is retired


Now lets take a look at how the decider function works:

In [34]:
#Sample arguments for decider function 
pp_matrix_sample = PP_matrices['pp_matrix'][0]
ML_dec_sample = np.array([.1, .05, .05, .05, .05, .6, .001, .014, .1, .002, .005, .0002, 0, 0, 0])
R_lim_sample = 20
no_annotators_sample = len(PP_matrices['pp_matrix'][0][0])



decider(pp_matrix_sample, ML_dec_sample, t, R_lim_sample, no_annotators_sample) #call decider function with sample arguments



Image is given to the upper class


NameError: name 'pp_matrix2' is not defined