First, we must import the needed modules to load and work with the data

In [117]:
import numpy as np
from scipy.io import loadmat
import matplotlib.pyplot as plt



Load data containing user information, including their confusion matrix, and prior information on each image. 

In [118]:
true_labels = loadmat('true_labels.mat')
retired_images = loadmat('retired_images.mat')
conf_matrices = loadmat('conf_matrices.mat')
PP_matrices = loadmat('PP_matrices.mat')

Lets talk boout the conf_matrices.mat file, that stores the confusion matrices of all the users. This information (stored in a .datfile) will be a NX1 array where N is the number of users. in each row we have a cell array that contains the CXC "confusion matrix" for that user. A perfectly skilled user would only have values on the diagonal of this matrix and all off diagonal values indicate wrong answers were given to one category or another when presented with a 'G' true labelled image. To illustrate this we'll print out a users confusion matrix. 

In [119]:
print conf_matrices['conf_matrices'][0][0][0]

[[191   3   5   2   5   3   3   6   3   1   2   1   1   5   1]
 [  4 184   1   4   5   4   1   3   5   1   5   4   1   4   5]
 [  3   3 217   4   2   2   4   2   1   1   3   2   4   3   1]
 [  4   2   2 213   2   2   4   5   2   1   1   3   4   2   4]
 [  4   1   2   3 210   4   3   1   1   5   4   5   3   4   2]
 [  5   1   3   1   2 220   1   5   5   1   5   5   2   4   5]
 [  5   4   4   3   3   3 181   1   3   1   1   3   3   5   3]
 [  2   4   5   2   5   5   2 198   4   3   4   4   2   5   1]
 [  3   1   4   3   4   1   1   5 210   5   5   2   5   4   1]
 [  4   4   2   4   5   1   1   4   3 188   1   2   1   2   4]
 [  2   3   5   2   3   5   1   1   1   1 194   5   2   3   4]
 [  2   1   5   2   3   2   1   2   4   1   4 215   1   5   4]
 [  1   5   5   5   5   1   1   2   5   4   1   1 213   3   2]
 [  3   4   5 214   4   2 183 199   1   6   6 216 214   4   5]
 [  4   4   5   4   3   5   1   3   4   2   5   5   2   3 192]]


For example, the value in the first row, second column is a 3. This indicates that the user classified 3 images in class 2, when really they should be in class 1. Confusion matrices are used to evaluate users skill level, and updated as images are retired. 

Here's a good visualization of the matrix in heat map form:

In [120]:
data = conf_matrices['conf_matrices'][0][0][0]
plt.matshow(data, cmap='Blues', norm=colors.LogNorm(vmin=data.min(), vmax=data.max()))
plt.colorbar()
plt.xlabel('classes')
plt.ylabel('classes')
plt.title('Visualization of confusion matrix')
ax = plt.gca()
ax.set_xticks(np.arange(0,15,1))
ax.set_yticks(np.arange(0,15,1))
plt.show()



Now, we will define the main function that evaluates the batches of images labeled by users. The Zooniverse server (specifically Nero https://github.com/zooniverse/nero) will continuously send data (batches) to be evaluated. 

Lets load in and print some of that data using the loadmat function we imported.

In [121]:
batch = loadmat('batch1.mat')
print batch['images'][0][0]

([u'T'], [[14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 3, 13, 6, 13, 11, 5, 11, 5, 12]], [[18, 29, 24, 22, 30, 14, 13, 16, 1, 9, 7, 27, 23, 2, 21, 11, 12, 15, 28, 5, 25, 8, 20]], [[0.014317919728839477, 0.014317919728839477, 0.014317919728839477, 0.014317919728839477, 0.014317919728839477, 0.014317919728839477, 0.014317919728839477, 0.014317919728839477, 0.014317919728839477, 0.014317919728839477, 0.014317919728839477, 0.014317919728839477, 0.014317919728839477, 0.7995491237962473, 0.014317919728839477]], [[-1]], [[30]])


Okay, thats a lot of information for one image. Lets breakdown what these data 'batches' actually store, and how to access certain parts.

The ID of the users, the ID of the image they classified, and 
the classification made by that user for that image are all held in the batches. Lets access them.

In [122]:
Sample_ImageID = batch['images'][0]['imageID'][0][0][0]
Sample_UserIDs = batch['images'][0]['IDs'][0][0]
Sample_classifications = batch['images'][0]['labels'][0][0]

print Sample_ImageID
print Sample_UserIDs
print Sample_classifications


30
[18 29 24 22 30 14 13 16  1  9  7 27 23  2 21 11 12 15 28  5 25  8 20]
[14 14 14 14 14 14 14 14 14 14 14 14 14 14  3 13  6 13 11  5 11  5 12]


The first number is the Image ID of a specific image.

The second array holds the user ID's that classified the image.

The third array holds the classifications of the image made by the users corresponding (one to one) to the second array.

Batches also contain an images Type, Machine Learning (ML) Posterior, and a True label:

The Type - A label (string) either 'T' or 'G' to determine if it is a ML classified label ('T') or a pre-labelled "golden" image ('G')

ML Posterior - An array (double) of a  1XC row vector where C is the number of pre-determined morphologies that the classifier has been trained on. Each column is the ML confidence (percentage) that the image belongs in one of the C classes.

True Label - (int) For images labelled 'T' this values is set to -1 but for images labelled 'G' This value indicates the "true" class that this image belongs in for the purposes of comparing a citizens classification with this true label.

Lets access these elements of the batches

In [123]:
Sample_type = batch['images'][0]['type'][0][0]
Sample_ML_Posterior = batch['images'][0]['ML_posterior'][0][0]
Sample_TrueLabel = batch['images'][0]['truelabel'][0][0][0]

print Sample_type
print Sample_ML_Posterior
print Sample_TrueLabel

T
[ 0.01431792  0.01431792  0.01431792  0.01431792  0.01431792  0.01431792
  0.01431792  0.01431792  0.01431792  0.01431792  0.01431792  0.01431792
  0.01431792  0.79954912  0.01431792]
-1


What does all this information mean?

This image has type 'T' meaning it is not a golden image and is still in testing.

From the ML_Posterior, we can see that the Machine is 79.95% sure this image is in the 14th class, and  1.43% sure it is in each other class.

A true label of -1 means this image is still testing and has not been assigned a class.

Now that we understand the structure of each batch and how to access specific information within the batches, lets move on to evaluating image and user classifications.

To start, lets set a flat prior for each image. This means that before more information is analyzed, we are giving an equal chance that the image is in each of the 15 classes.

In [124]:
no_labels = np.histogram((true_labels['true_labels'][0]),np.unique(true_labels['true_labels'][0]))
priors = no_labels[1]/len(true_labels['true_labels'][0])


Initialize R_lim, the limit on how many people can look at an image before it is passed onto a higher skill level

In [125]:
R_lim = 23

Initialize N, the number of images in a batch

In [126]:
N = len(batch['images'])

initialize C, the number of morphologies (classes)

In [127]:
for i in range(N):
    if batch['images'][i][0][0] == 'T':
        C = len(batch['images'][i]['ML_posterior'][0][0])


Initialize varaible t. t is a CX1 column vector where C is the number of pre-determined morphologies and where each row is the predetermined certainty threshold that an image most surpass to be considered part of class C. Here all classes have the same threshold but in realty different categories will have more difficult or more relax thresholds for determination of class and, as a result, 
retirability.

In [128]:
t = .4*np.ones((C,1))

initialize a decision matrix that holds the decision for each image

In [129]:
 dec_matrix = np.zeros((1,N))

initialize a class matrix that holds the True labels of each image. (corresponds one to one to decision matrix)

In [130]:
class_matrix = np.zeros((1,N))

initialize a list to hold the pp_matrices for each images. We'll talk about the importance of pp_matrices later.

In [131]:
pp_matrices_rack = []

Now lets look at how a decision for an image is made. The decider function takes in an images posterior matrix, Machine learning decision, number of annotators, and R_lim as arguments, and uses that information to decide the next step for the image.

In [135]:
def decider(pp_matrix, ML_dec, t, R_lim, no_annotators): #define the decider function with given arguments
    pp_matrix2 = np.append(pp_matrix, ML_dec.reshape((15,1))) #Include ML_decision in posterior matrix
    v = np.sum(pp_matrix2, axis=1)/np.sum(pp_matrix) #create vector of normalized sums of pp_matrix2 
    maximum = np.amax(v) #initialize maximum, max value of v
    maxIdx = np.argmax(v) #initialize maxIdx, index of max value of v

    if maximum >= t[maxIdx]: #if maximum is above threshold for that specific class

        decision = 1 #retire the image
        print('Image is retired')

    elif no_annotators >= R_lim: #if more than R_lim annotators have looked at image and no decision reached

        decision = 2 #pass image on to next user skill class
        print('Image is given to the upper class')

    else: #if fewer than R_lim annotators have looked at image

        decision = 3 #keep image in same class
        print('More labels are needed for the image')

    image_class = maxIdx #set image_class 

    return decision, image_class #return the decision, and image class

The next chunk of code is used to update users confusion matrices, images posterior matrices (PP_matrix), and to make decisions on the fate of an image.

In [136]:
for i in range(N): #For each image in the batch

    if batch['images'][i]['type'][0][0] == 'G': #check if image is in golden set
        labels = batch['images'][i]['labels'][0][0] #take citizen labels of image
        userIDs = batch['images'][i]['IDs'][0][0] #take IDs of citizens who labeled the image
        tlabel = batch['images'][i]['truelabel'][0][0][0] #take true label of image

        for ii in range(len(userIDs)): #For each user 

            indicator = 0 #initialize indicator to zero for each user

            for cc in range(len(conf_matrices['conf_matrices'][0])): #for each users confusion matrix

                if userIDs[ii] == conf_matrices['conf_matrices'][cc]['userID'][0][0][0]: #if user already has a confusion matrix

                    conf_matrix = conf_matrices['conf_matrices'][cc]['conf_matrix'][0] #take confusion matrix of citizen
                    conf_matrix[tlabel-1,labels[ii]-1] += 1 #update confusion matrix, rewarding user for correct classification
                    conf_matrices['conf_matrices'][cc]['conf_matrix'][0] = conf_matrix #confusion matrix put back in stack
                    indicator = 1 #user is already registered

            if indicator == 0: #if user not registered

                dummy_matrix = np.zeros((C,C)) #create dummy confusiong matrix
                dummy_matrix[tlabel-1,labels[ii]-1] += 1 #update dummy matrix, rewarding user for correct classification
                #conf_matrices['conf_matrices'] = np.append(conf_matrices['conf_matrices'][0], dummy_matrix) #append to confusion matrices
                #conf_matrices(end + 1).userID = IDs(ii)

        dec_matrix[0,i] = 0 #Since image is golden, no decision needs to be made about the image
        class_matrix[0,i] = tlabel #class of image is its true label
        print('The image is from the training set')

    else: #if image not in golden set, i.e. has ML label but no true label

        indicator1 = 0 #initialize indicator1 to zero

        for kk in range(len(retired_images['retired_images'][0])): #for each retired image

            if batch['images'][i]['imageID'][0][0][0] == retired_images['retired_images'][0][kk]['imageID'][0][0]: #if image is already retired

                indicator1 = 1 #set indicator1 to one, meaning image has already been retired
                dec_matrix[0,i] = -1 #give invalid decision

        if indicator1 == 0: #if image has not already been retired

            labels = batch['images'][i]['labels'][0][0] #take citizen labels of image
            userIDs = batch['images'][i]['IDs'][0][0] #take IDs of citizens who label image
            no_annotators = len(labels) #take number of citizens who labeled the image
            ML_dec = batch['images'][i]['ML_posterior'][0][0] #take ML posteriors of image
            imageID = batch['images'][i]['imageID'][0][0][0] #take ID of image
            image_prior = priors #set priors for image to original priors

            for y in range(len(PP_matrices['PP_matrices'][0])): #for each images posterior matrix

                if imageID == PP_matrices['PP_matrices'][0][y]['imageID'][0][0]: #If image already has a posterior matrix

                    image_prior = np.sum(PP_matrices['PP_matrices'][0][y]['matrix'],axis=1)/np.sum(PP_matrices['PP_matrices'][0][y]['matrix']) #Use posterior matrix information to update an images prior

            for j in range(C): #for each class

                for k in range(no_annotators): #for each citizen that labeled image

                    for iN in range(len(conf_matrices['conf_matrices'][0])): #for each confusion matrix
                
                        if userIDs[k] == conf_matrices['conf_matrices'][iN]['userID'][0][0][0]: #If citizen already has a confusion matrix

                            conf = conf_matrices['conf_matrices'][iN]['conf_matrix'][0] #take confusion matrix of citizen
                            conf_divided = np.diag(sum(conf,2))/conf #calculate p(l|j) value
                            pp_matrix = np.zeros((C,no_annotators)) #create posterior matrix
                            pp_matrix[j,k] = ((conf_divided[j,(labels[k])])*priors[j])/sum(conf_divided[:,(labels[k])]*priors) #Fill posterior matrix with values
                            pp_matrices_rack.append(pp_matrix) #put PP_matrix in the rack

                break

        dec_matrix[0,i], class_matrix[0,i] = decider(pp_matrix, ML_dec, t, R_lim, no_annotators) #make decisions for each image in batch using decider function.
      

NameError: name 'pp_matrix' is not defined

Now we'll put the decider function in action:

In [137]:
def decider(pp_matrix, ML_dec, t, R_lim, no_annotators): 
    print "Here is the Posterior matrix" , pp_matrix
    pp_matrix2 = np.append(pp_matrix, ML_dec.reshape((15,1))) 
    print "Here is the Posterior matrix with the ML Posterior" , pp_matrix2
    v = np.sum(pp_matrix2, axis=1)/np.sum(pp_matrix) 
    print "Here is our normalized posterior vector" , v
    maxIdx = np.argmax(v) 
    print "Here is the most likely class for the image" , maxIdx
    maximum = np.amax(v) 
    print "Here is the likelihood the image is in class $d" % maxIdx, maximum
    if maximum >= t[maxIdx]: 

        decision = 1
        print('Image is retired')

    elif no_annotators >= R_lim: 

        decision = 2 
        print('Image is given to the upper class')

    else: 
        decision = 3
        print('More labels are needed for the image')

    image_class = maxIdx 

    return decision, image_class
pp_matrix_sample = PP_matrices['PP_matrices'][0][0][1]
ML_dec_sample = np.array([.1, .05, .05, .05, .05, .6, .001, .014, .1, .002, .005, .0002, 0, 0, 0, 0])
R_lim_sample = 20
no_annotators_sample = len(PP_matrices['PP_matrices'][0][0][1][0])
decider(pp_matrix_sample, ML_dec_sample, t, R_lim_sample, no_annotators_sample )

Here is the Posterior matrix [[ 0.01744259  0.01731697  0.02493144  0.01028312  0.00546532  0.01070963
   0.02262277  0.01929081  0.02527572  0.00479306  0.01134239  0.01906742
   0.02150386  0.02335514  0.00585403  0.01772854  0.00278222  0.00497152
   0.00882818  0.02052645  0.01243257  0.00262403  0.01197271  0.01344635
   0.0114498   0.01040049  0.01084297]
 [ 0.02919905  0.00531959  0.02115639  0.02461174  0.02292532  0.0295438
   0.00979726  0.00594736  0.01623594  0.02340578  0.01826028  0.00613758
   0.01612789  0.02395399  0.01592205  0.01772854  0.00834666  0.0024543
   0.00236603  0.74853645  0.00796131  0.01312015  0.00420618  0.01140903
   0.0039356   0.7772285   0.00782651]
 [ 0.02305456  0.01614877  0.01209636  0.01459731  0.00514     0.00590876
   0.01982873  0.00699853  0.00549823  0.01589943  0.01833932  0.00539644
   0.01775423  0.00528796  0.00598241  0.73013397  0.00540801  0.00490859
   0.00455822  0.01115017  0.00255082  0.00237706  0.83404803  0.00288504
   0.01

ValueError: total size of new array must be unchanged

In [30]:
print len(PP_matrices['PP_matrices'][0][0][1][0])

27
