# MIA Annotation Analysis

In this notebook, we will analyze the outputs of the annotation experiments. 

The annotations are recorded in JSON format. Four experiments are being executed and the data from each experiment is stored in a different folder under the main folder DATA. 

1) Image Sequences | Segmented Images

2) Image Sequences | Unsegmented Images

3) Individual Image | Segmented Images

4) Individual Image | Unsegmented Images

In [2]:
# This function will give a list of files in a folder. 
def getListOfFiles(dirName):
    # create a list of file and sub directories 
    # names in the given directory 
    listOfFile = os.listdir(dirName)
    allFiles = list()
    # Iterate over all the entries
    for entry in listOfFile:
        # Create full path
        fullPath = os.path.join(dirName, entry)
        # If entry is a directory then get the list of files in this directory 
        if os.path.isdir(fullPath):
            allFiles = allFiles + getListOfFiles(fullPath)
        else:
            allFiles.append(fullPath)
                
    return allFiles

In [3]:
import os

# The directory path to the images folder. This folder contains ALL images (i.e. segmented, unsegmented, positive, negative.)
# basedir = 'C:\\Users\\Deniz\\Dropbox\\Work\\xampp\\htdocs\\mia\\poi_seq\\input\\'
basedir = 'C:\\Users\\P70065719\\Desktop\\Thesis\\CROWD_TEST'
l_files = getListOfFiles(basedir)

png_files = []
non_repeat = []
for i in l_files:
    # just the PNG files
    if '.png' in i: 
        png_files.append(i)
        path, filename = os.path.split(i)
        #print(path)
        # we remove the duplicate names because these names refer to 5 images in a sequence, not just one image. 
        if path not in non_repeat: 
            non_repeat.append(path)

#for y in non_repeat: 
    #print(y)
print(len(non_repeat), ' PNG images were found in the folder.')

81  PNG images were found in the folder.


In [4]:
for i in range(10):
    print(non_repeat[i])

C:\Users\P70065719\Desktop\Thesis\CROWD_TEST\Negative_samples\seg_op\subset0\1.3.6.1.4.1.14519.5.2.1.6279.6001.404364125369979066736354549484\65
C:\Users\P70065719\Desktop\Thesis\CROWD_TEST\Negative_samples\seg_op\subset0\1.3.6.1.4.1.14519.5.2.1.6279.6001.657775098760536289051744981056\51
C:\Users\P70065719\Desktop\Thesis\CROWD_TEST\Negative_samples\seg_op\subset1\1.3.6.1.4.1.14519.5.2.1.6279.6001.206539885154775002929031534291\124
C:\Users\P70065719\Desktop\Thesis\CROWD_TEST\Negative_samples\seg_op\subset1\1.3.6.1.4.1.14519.5.2.1.6279.6001.250397690690072950000431855143\73
C:\Users\P70065719\Desktop\Thesis\CROWD_TEST\Negative_samples\seg_op\subset1\1.3.6.1.4.1.14519.5.2.1.6279.6001.888291896309937415860209787179\85
C:\Users\P70065719\Desktop\Thesis\CROWD_TEST\Negative_samples\seg_op\subset2\1.3.6.1.4.1.14519.5.2.1.6279.6001.121993590721161347818774929286\102
C:\Users\P70065719\Desktop\Thesis\CROWD_TEST\Negative_samples\seg_op\subset3\1.3.6.1.4.1.14519.5.2.1.6279.6001.10095348302819217

__Rename original files__ 

The following cells were executed to rename the files from the original dataset. Because, we would like to give corresponding names to the 5-files in a sequence for convenience. This is done before the annotation. It will help us trace back if necessary.  

In [5]:
## Renaming original files - PART 1
export = []
index = 0
for p in png_files: 
    index = index + 1
    if index == 6:
        index = 1
    path, filename = os.path.split(p)
    export.append([p, filename, '_' + str(index) + '.png'])
    
for x in export:
    print(x)
    

['C:\\Users\\P70065719\\Desktop\\Thesis\\CROWD_TEST\\Negative_samples\\seg_op\\subset0\\1.3.6.1.4.1.14519.5.2.1.6279.6001.404364125369979066736354549484\\65\\_1.png', '_1.png', '_1.png']
['C:\\Users\\P70065719\\Desktop\\Thesis\\CROWD_TEST\\Negative_samples\\seg_op\\subset0\\1.3.6.1.4.1.14519.5.2.1.6279.6001.404364125369979066736354549484\\65\\_2.png', '_2.png', '_2.png']
['C:\\Users\\P70065719\\Desktop\\Thesis\\CROWD_TEST\\Negative_samples\\seg_op\\subset0\\1.3.6.1.4.1.14519.5.2.1.6279.6001.404364125369979066736354549484\\65\\_3.png', '_3.png', '_3.png']
['C:\\Users\\P70065719\\Desktop\\Thesis\\CROWD_TEST\\Negative_samples\\seg_op\\subset0\\1.3.6.1.4.1.14519.5.2.1.6279.6001.404364125369979066736354549484\\65\\_4.png', '_4.png', '_4.png']
['C:\\Users\\P70065719\\Desktop\\Thesis\\CROWD_TEST\\Negative_samples\\seg_op\\subset0\\1.3.6.1.4.1.14519.5.2.1.6279.6001.404364125369979066736354549484\\65\\_5.png', '_5.png', '_5.png']
['C:\\Users\\P70065719\\Desktop\\Thesis\\CROWD_TEST\\Negative_sam

In [6]:
## Renaming original files - PART 2
import csv 
with open('file_rename.csv', 'w', newline='') as myfile:
    wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
    for x in export: 
        wr.writerow(x)
            
print('done')

done


In [7]:
## Renaming original files - PART 3
for x in export:
    os.rename(x[0], x[0].replace(x[1], x[2]))
    #print(x[0], x[0].replace(x[1], x[2]))
print('Done!')
#print(path)
#print(filename)


Done!


__Ready tof the annotation__

At this point, the files are renamed and prepared for the annotation task. 

## Gold Standard

In this part, we load the gold standard annotations from the LUNA annotation spreadsheets. 

Important things to consider are the 'seriesuid' (patient id), 'diameter_mm' (tumor size), 'Xspac', 'Xnpy', 'Ynpy', 'slicenumber'

In [8]:
import pandas as pd 

gold_std = pd.read_excel('new_annotations_LUNA.xlsx')
gold_std.head()
gold_std_relevant = gold_std[['seriesuid', 'diameter_mm', 'Xspac', 'Xnpy', 'Ynpy', 'slicenumber']]
#for index, row in gold_std_relevant.iterrows():
#    print(row['seriesuid'])
gold_std_relevant

Unnamed: 0,seriesuid,diameter_mm,Xspac,Xnpy,Ynpy,slicenumber
0,1.3.6.1.4.1.14519.5.2.1.6279.6001.534083630500...,5.965580,0.654297,113.865362,342.003187,151
1,1.3.6.1.4.1.14519.5.2.1.6279.6001.534083630500...,5.090964,0.654297,181.711839,250.958298,149
2,1.3.6.1.4.1.14519.5.2.1.6279.6001.275755514659...,7.663782,0.761719,140.049968,215.827454,264
3,1.3.6.1.4.1.14519.5.2.1.6279.6001.332829333783...,5.264828,0.816406,237.935511,366.272552,164
4,1.3.6.1.4.1.14519.5.2.1.6279.6001.187966156856...,4.073946,0.664062,375.606031,231.577142,316
...,...,...,...,...,...,...
1167,1.3.6.1.4.1.14519.5.2.1.6279.6001.213140617640...,13.962632,0.617188,318.247500,184.925110,199
1168,1.3.6.1.4.1.14519.5.2.1.6279.6001.290135156874...,18.998648,0.703125,169.257897,395.375280,68
1169,1.3.6.1.4.1.14519.5.2.1.6279.6001.312127933722...,9.484890,0.605469,287.062047,218.740562,131
1170,1.3.6.1.4.1.14519.5.2.1.6279.6001.219349715895...,10.378710,0.761719,346.914445,160.825413,101


In [9]:
# Get the gold standard data of a particular patient (seriesuid) on a particular slice (slicenumber) as a DataFrame.
# This function could return a DataFrame having zero, one, or multiple rows.
def getGoldData(patient_id, gold_std_relevant, layer=0):
    gold_std_relevant['slicenumber'] = gold_std_relevant['slicenumber'].astype(int)
    if int(layer) > 0:
        dfRet = gold_std_relevant.loc[(gold_std_relevant['seriesuid'] == patient_id) & (abs(gold_std_relevant['slicenumber'] - int(layer)) < 3)]
    else:
        dfRet = gold_std_relevant.loc[gold_std_relevant['seriesuid'] == patient_id]
    #print(dfRet)
    return dfRet

# Given a filePath as string, this function identifies the seriesuid and slicenumber inside the path. 
# IMPORTANT NOTE TO AAKANKSHA: The for loop turned out to be important because the seriesuid does not always appear at the 3rd split-element from the end. :) 
def getPatientIdFromFilePath(filePath):
    retF = ''
    retLayer = ''
    spltf = filePath.split('\\')
    for spi in spltf:
        if '1.3' in spi:
            retF = spi
    #retF = spltf[-3]  # This does not work in some cases! 
    retLayer = spltf[-2]
    return retF, retLayer


In [10]:
# An example of using the the getGoldData function. Here, we do not set the optional parameter layer. 
# It means, tumors on all layers will be returned by the function as rows.
getGoldData('1.3.6.1.4.1.14519.5.2.1.6279.6001.387954549120924524005910602207',gold_std_relevant)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


Unnamed: 0,seriesuid,diameter_mm,Xspac,Xnpy,Ynpy,slicenumber
240,1.3.6.1.4.1.14519.5.2.1.6279.6001.387954549120...,4.945113,0.546875,79.15619,279.706132,170
421,1.3.6.1.4.1.14519.5.2.1.6279.6001.387954549120...,4.528739,0.546875,323.558215,263.566154,108


In [11]:
# In this example, we also set the layer parameter. It will give us the =- 3 slices from the layer we set. 
getGoldData('1.3.6.1.4.1.14519.5.2.1.6279.6001.387954549120924524005910602207',gold_std_relevant, 172)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


Unnamed: 0,seriesuid,diameter_mm,Xspac,Xnpy,Ynpy,slicenumber
240,1.3.6.1.4.1.14519.5.2.1.6279.6001.387954549120...,4.945113,0.546875,79.15619,279.706132,170


In [12]:
# An example of how to select the gold data of a particular series.

# Next line does not work for some reason. Instead use the one below that. 
#gold_std_relevant.loc[gold_std_relevant['seriesuid'] == '1.3.6.1.4.1.14519.5.2.1.6279.6001.387954549120924524005910602207']

gold_std_relevant[gold_std_relevant.seriesuid.str.contains('1.3.6.1.4.1.14519.5.2.1.6279.6001',case=False)]

Unnamed: 0,seriesuid,diameter_mm,Xspac,Xnpy,Ynpy,slicenumber
0,1.3.6.1.4.1.14519.5.2.1.6279.6001.534083630500...,5.965580,0.654297,113.865362,342.003187,151
1,1.3.6.1.4.1.14519.5.2.1.6279.6001.534083630500...,5.090964,0.654297,181.711839,250.958298,149
2,1.3.6.1.4.1.14519.5.2.1.6279.6001.275755514659...,7.663782,0.761719,140.049968,215.827454,264
3,1.3.6.1.4.1.14519.5.2.1.6279.6001.332829333783...,5.264828,0.816406,237.935511,366.272552,164
4,1.3.6.1.4.1.14519.5.2.1.6279.6001.187966156856...,4.073946,0.664062,375.606031,231.577142,316
...,...,...,...,...,...,...
1167,1.3.6.1.4.1.14519.5.2.1.6279.6001.213140617640...,13.962632,0.617188,318.247500,184.925110,199
1168,1.3.6.1.4.1.14519.5.2.1.6279.6001.290135156874...,18.998648,0.703125,169.257897,395.375280,68
1169,1.3.6.1.4.1.14519.5.2.1.6279.6001.312127933722...,9.484890,0.605469,287.062047,218.740562,131
1170,1.3.6.1.4.1.14519.5.2.1.6279.6001.219349715895...,10.378710,0.761719,346.914445,160.825413,101


In [13]:
# In which range do the gold coordinates change? Apparently, it changes between 17 and 488. 
# It makes sense because the images are 512 by 512 in size. 
gold_std_relevant['Xnpy'].describe()

count    1172.000000
mean      249.471393
std       122.217838
min        17.884711
25%       140.023203
50%       211.251631
75%       367.294984
max       488.890586
Name: Xnpy, dtype: float64

In [14]:
# Get the gold data for all files in a folder. 
flist = getListOfFiles(basedir)

for i in flist:
    f, l = getPatientIdFromFilePath(i)
    print(getGoldData(f,gold_std_relevant))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


                                             seriesuid  diameter_mm     Xspac  \
684  1.3.6.1.4.1.14519.5.2.1.6279.6001.404364125369...     5.063233  0.585938   

           Xnpy        Ynpy  slicenumber  
684  436.693401  346.054771           70  
                                             seriesuid  diameter_mm     Xspac  \
684  1.3.6.1.4.1.14519.5.2.1.6279.6001.404364125369...     5.063233  0.585938   

           Xnpy        Ynpy  slicenumber  
684  436.693401  346.054771           70  
                                             seriesuid  diameter_mm     Xspac  \
684  1.3.6.1.4.1.14519.5.2.1.6279.6001.404364125369...     5.063233  0.585938   

           Xnpy        Ynpy  slicenumber  
684  436.693401  346.054771           70  
                                             seriesuid  diameter_mm     Xspac  \
684  1.3.6.1.4.1.14519.5.2.1.6279.6001.404364125369...     5.063233  0.585938   

           Xnpy        Ynpy  slicenumber  
684  436.693401  346.054771           70  
    

Index: []
Empty DataFrame
Columns: [seriesuid, diameter_mm, Xspac, Xnpy, Ynpy, slicenumber]
Index: []
Empty DataFrame
Columns: [seriesuid, diameter_mm, Xspac, Xnpy, Ynpy, slicenumber]
Index: []
Empty DataFrame
Columns: [seriesuid, diameter_mm, Xspac, Xnpy, Ynpy, slicenumber]
Index: []
Empty DataFrame
Columns: [seriesuid, diameter_mm, Xspac, Xnpy, Ynpy, slicenumber]
Index: []
Empty DataFrame
Columns: [seriesuid, diameter_mm, Xspac, Xnpy, Ynpy, slicenumber]
Index: []
                                            seriesuid  diameter_mm     Xspac  \
48  1.3.6.1.4.1.14519.5.2.1.6279.6001.226889213794...     5.244469  0.742188   

         Xnpy        Ynpy  slicenumber  
48  343.84034  397.050149           42  
                                            seriesuid  diameter_mm     Xspac  \
48  1.3.6.1.4.1.14519.5.2.1.6279.6001.226889213794...     5.244469  0.742188   

         Xnpy        Ynpy  slicenumber  
48  343.84034  397.050149           42  
                                            

                                             seriesuid  diameter_mm     Xspac  \
778  1.3.6.1.4.1.14519.5.2.1.6279.6001.206539885154...     7.748455  0.703125   
820  1.3.6.1.4.1.14519.5.2.1.6279.6001.206539885154...     5.071227  0.703125   

           Xnpy        Ynpy  slicenumber  
778  452.679367  272.900243           45  
820  462.581321  245.180584           71  
                                             seriesuid  diameter_mm     Xspac  \
778  1.3.6.1.4.1.14519.5.2.1.6279.6001.206539885154...     7.748455  0.703125   
820  1.3.6.1.4.1.14519.5.2.1.6279.6001.206539885154...     5.071227  0.703125   

           Xnpy        Ynpy  slicenumber  
778  452.679367  272.900243           45  
820  462.581321  245.180584           71  
                                             seriesuid  diameter_mm     Xspac  \
778  1.3.6.1.4.1.14519.5.2.1.6279.6001.206539885154...     7.748455  0.703125   
820  1.3.6.1.4.1.14519.5.2.1.6279.6001.206539885154...     5.071227  0.703125   

          

1132  193.89157  290.153066          108  
                                              seriesuid  diameter_mm    Xspac  \
1132  1.3.6.1.4.1.14519.5.2.1.6279.6001.272348349298...    22.880792  0.78125   

           Xnpy        Ynpy  slicenumber  
1132  193.89157  290.153066          108  
                                             seriesuid  diameter_mm     Xspac  \
526  1.3.6.1.4.1.14519.5.2.1.6279.6001.280125803152...     5.905236  0.703125   
625  1.3.6.1.4.1.14519.5.2.1.6279.6001.280125803152...     5.597494  0.703125   
728  1.3.6.1.4.1.14519.5.2.1.6279.6001.280125803152...     6.844315  0.703125   

           Xnpy        Ynpy  slicenumber  
526  288.268210  366.239927           79  
625  301.058107  347.720882           42  
728  177.602170  267.089272           49  
                                             seriesuid  diameter_mm     Xspac  \
526  1.3.6.1.4.1.14519.5.2.1.6279.6001.280125803152...     5.905236  0.703125   
625  1.3.6.1.4.1.14519.5.2.1.6279.6001.2801258031

863  319.844656  333.755735           21  
                                              seriesuid  diameter_mm  \
85    1.3.6.1.4.1.14519.5.2.1.6279.6001.148447286464...     4.762203   
343   1.3.6.1.4.1.14519.5.2.1.6279.6001.148447286464...     4.181188   
357   1.3.6.1.4.1.14519.5.2.1.6279.6001.148447286464...     5.209168   
386   1.3.6.1.4.1.14519.5.2.1.6279.6001.148447286464...     5.062185   
632   1.3.6.1.4.1.14519.5.2.1.6279.6001.148447286464...     8.679782   
1124  1.3.6.1.4.1.14519.5.2.1.6279.6001.148447286464...    23.483745   

         Xspac        Xnpy        Ynpy  slicenumber  
85    0.703125  386.080098  256.565591           76  
343   0.703125  314.358005  205.816883           99  
357   0.703125  164.781306  165.693238           90  
386   0.703125  349.596155  214.298994          169  
632   0.703125  376.297183  289.029245          158  
1124  0.703125  334.177558  207.219116          144  
                                              seriesuid  diameter_mm  \
85

                                              seriesuid  diameter_mm  \
633   1.3.6.1.4.1.14519.5.2.1.6279.6001.301462380687...     8.688066   
862   1.3.6.1.4.1.14519.5.2.1.6279.6001.301462380687...    13.057559   
1004  1.3.6.1.4.1.14519.5.2.1.6279.6001.301462380687...    12.764006   

         Xspac        Xnpy        Ynpy  slicenumber  
633   0.703125  343.071348  359.862986          238  
862   0.703125  385.875372  319.518180          189  
1004  0.703125  359.782905  300.192890          206  
                                              seriesuid  diameter_mm  \
633   1.3.6.1.4.1.14519.5.2.1.6279.6001.301462380687...     8.688066   
862   1.3.6.1.4.1.14519.5.2.1.6279.6001.301462380687...    13.057559   
1004  1.3.6.1.4.1.14519.5.2.1.6279.6001.301462380687...    12.764006   

         Xspac        Xnpy        Ynpy  slicenumber  
633   0.703125  343.071348  359.862986          238  
862   0.703125  385.875372  319.518180          189  
1004  0.703125  359.782905  300.192890      

1124  0.703125  334.177558  207.219116          144  
                                              seriesuid  diameter_mm  \
85    1.3.6.1.4.1.14519.5.2.1.6279.6001.148447286464...     4.762203   
343   1.3.6.1.4.1.14519.5.2.1.6279.6001.148447286464...     4.181188   
357   1.3.6.1.4.1.14519.5.2.1.6279.6001.148447286464...     5.209168   
386   1.3.6.1.4.1.14519.5.2.1.6279.6001.148447286464...     5.062185   
632   1.3.6.1.4.1.14519.5.2.1.6279.6001.148447286464...     8.679782   
1124  1.3.6.1.4.1.14519.5.2.1.6279.6001.148447286464...    23.483745   

         Xspac        Xnpy        Ynpy  slicenumber  
85    0.703125  386.080098  256.565591           76  
343   0.703125  314.358005  205.816883           99  
357   0.703125  164.781306  165.693238           90  
386   0.703125  349.596155  214.298994          169  
632   0.703125  376.297183  289.029245          158  
1124  0.703125  334.177558  207.219116          144  
                                              seriesuid  diamet

1004  0.703125  359.782905  300.192890          206  
                                              seriesuid  diameter_mm  \
633   1.3.6.1.4.1.14519.5.2.1.6279.6001.301462380687...     8.688066   
862   1.3.6.1.4.1.14519.5.2.1.6279.6001.301462380687...    13.057559   
1004  1.3.6.1.4.1.14519.5.2.1.6279.6001.301462380687...    12.764006   

         Xspac        Xnpy        Ynpy  slicenumber  
633   0.703125  343.071348  359.862986          238  
862   0.703125  385.875372  319.518180          189  
1004  0.703125  359.782905  300.192890          206  
                                              seriesuid  diameter_mm  \
633   1.3.6.1.4.1.14519.5.2.1.6279.6001.301462380687...     8.688066   
862   1.3.6.1.4.1.14519.5.2.1.6279.6001.301462380687...    13.057559   
1004  1.3.6.1.4.1.14519.5.2.1.6279.6001.301462380687...    12.764006   

         Xspac        Xnpy        Ynpy  slicenumber  
633   0.703125  343.071348  359.862986          238  
862   0.703125  385.875372  319.518180      

### Display gold annotation on an image

Using CAROU image libraries, we display the annotations on an image. 

In [49]:
#Draw points on the image using PIL libraries.Cant find carou package anywhere


from PIL import Image, ImageDraw
 
def annotatePointsOnImage(img,pointsList,gold):
#     img = Image.new("RGB", (600,400), bg)
#     w, h = img.size
    draw = ImageDraw.Draw(img)
    if not gold:
        for p in pointsList:
            draw.line([(p[0],p[1]),(p[0]+1,p[1])],fill='red',width=4)
    else :
        for p in pointsList:
            draw.line([(p[0],p[1]),(p[0]+1,p[1])],fill='lime',width=4)
    return img

In [16]:
# from carou.annotation.spatial import display
import matplotlib.image as mpimg
import math

## Set the file index here to change the gold annotation. 
fileListIndex = 288

patId, patLay = getPatientIdFromFilePath(flist[fileListIndex])
# img = mpimg.imread(flist[fileListIndex])
img=Image.open(open(flist[fileListIndex], 'rb'))
dfPat = getGoldData(patId, gold_std_relevant, int(patLay))
dfPat.head()
print(flist[fileListIndex])
pointList = []
for index, row in dfPat.iterrows():
    x = int(math.floor(row['Xnpy']))
    y = int(math.floor(row['Ynpy']))
    pointList.append([x,y])
# display.displayImageWithAnnotations(img, pointList=pointList ,displayFigure=False)
img=annotatePointsOnImage(img,pointList)
img.show()

C:\Users\P70065719\Desktop\Thesis\CROWD_TEST\Positive_samples\segmented\segmented_subset8\1.3.6.1.4.1.14519.5.2.1.6279.6001.301462380687644451483231621986\237\_4.png


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


TypeError: annotatePointsOnImage() missing 1 required positional argument: 'gold'

## Annotations

Annotations are recorded in separate JSON files using the following structure: 

        "id": "5dcacec4dc861",
        "annotator": "banu",
        "url": "input\/CROWD_TEST\\Negative_samples\\seg_op\\subset1\\1.3.6.1.4.1.14519.5.2.1.6279.6001.206539885154775002929031534291\\124\\.png",
        "timestamp": "2019-11-12 16:24:52",
        "POIs": "[{\"x\":101,\"y\":444}]",
        "Polygons": "[]",
        "FreeDots": "[]",
        "Boxes": "[]",
        "DisplayTime": "1573572287386",
        "SubmitTime": "1573572292899"

In [17]:
## Check the annotation folder for JSON files.
import os

# Set the directory to the UnsegmentedSequence-JSON folder. Change this for analyzing the results of other experiments as well.
jsondir = 'C:\\Users\\P70065719\\Desktop\\CrowdExperimentResults\\DATA\\UnsegmentedSequence\\json\\'

l_files = getListOfFiles(jsondir)

data_files = []
for i in l_files:
    if 'data_' in i: # all annotation files use the nameing such as data_[WORKERID].json
        data_files.append(i)

print(len(data_files), ' json files were found')

41  json files were found


In [19]:
# We get the overall descriptives from all annotation files in the folder. 
# Some of them are expcted to contain missing annotations and all other sorts of weird things.  So better to check them before the analysis! 
# LESSONS LEARNED: People double click on web page buttons for some reason!
import json

desc_list = []

# traverse the data_files list that contains the list of all JSON files. 
for data_file in data_files:
    #print(data_file)
    annotator = ''
    no_of_anno = 0
    no_of_positive_anno = 0
    unique_url_list = []
    total_task_duration = 0
    if '.json' in data_file:
        with open(data_file) as json_file:
            data = json.load(json_file)
            # for each element (annotation instance) in the JSON dictionary.
            for p in data:
                no_of_anno = no_of_anno + 1
                coords = dict()
                annotator = p['annotator']
                if p['url'] not in unique_url_list:
                    unique_url_list.append(p['url'])
                    task_duration = math.floor(int(p['SubmitTime']) - int(p['DisplayTime'])) / 1000
                    total_task_duration = total_task_duration + task_duration
                if len(p['POIs']) > 5:
                    coords = eval(p['POIs'])[0]
                    x, y = coords['x'], coords['y']
                    no_of_positive_anno = no_of_positive_anno + 1
                else: 
                    x, y = -1, -1
            desc_list.append([annotator, no_of_anno, no_of_positive_anno, len(unique_url_list), math.floor(total_task_duration / len(unique_url_list))])

## desc_list contains a number of columns. 
## annotator: the user name of the annotator.
## no_of_anno: number of annotations submitted by the annotator. Sometimes the user submits the same data twice. 
##             This is either a system glitch on the server side or just a user doubleclicking the submit button.
## no_of_positive_anno: number of annotations in which a user indicated a tumor. Ideally this should be half of the dataset size.
## len(unique_url_list): This is the number of files actually annotated by workers. It should be the same as the dataset size, ideally.
## The last colum is the average time to complete a task in seconds. It should be more than 10 and smaller than 100, for example... 

print(desc_list)                

[['A10Q4U3BRHXXPP', 46, 46, 46, 17], ['A11318F9PB5FFY', 46, 40, 46, 10], ['A13G6IRFQBEE8K', 1, 1, 1, 43], ['A153J31AVDX32V', 49, 49, 46, 24], ['A1DUICEAQJVEQU', 46, 36, 46, 7], ['A1EIMXVUOG49SJ', 49, 30, 46, 12], ['A1GD2Z9370OBNG', 21, 20, 21, 8], ['A1GKD3NG1NNHRP', 46, 27, 46, 14], ['A1IG876U4QFF81', 1, 0, 1, 0], ['A1YCZENBDZ5GGZ', 29, 29, 29, 4], ['A23PQYQ6A2I076', 48, 42, 46, 18], ['A2526X03E9SRLI', 54, 13, 46, 12], ['A27ODHA3747UVP', 5, 5, 5, 19], ['A2A0MSLKF3ERF4', 46, 46, 46, 11], ['A2CYXHEA1EX07O', 46, 46, 46, 13], ['A2J76WH59IRPRE', 6, 6, 6, 28], ['A2M1CVZZJAN4T4', 46, 43, 46, 43], ['A2PU5BM2YXFQQS', 4, 4, 4, 7], ['A2V84QVZRULWF1', 1, 0, 1, 0], ['A30HUZHJBOX1LK', 9, 7, 9, 19], ['A31XFBQITA3FAP', 46, 43, 46, 14], ['A34B8T4EZZUZFV', 2, 2, 2, 14], ['A3AJE1ORZMFDON', 1, 1, 1, 64], ['A3AK3UL0UCNVKE', 5, 3, 5, 9], ['A3F8UT6178B2A4', 46, 30, 46, 20], ['A3L8H2MPUNI45Q', 3, 3, 3, 22], ['A3P04JZJNOYILR', 11, 6, 10, 6], ['A3UF6XXFFRR237', 56, 55, 46, 28], ['A90G0G4SJ26BM', 7, 6, 7, 18], [

In [20]:
# HERE WE ELIMINATE THE LOW QUALITY / INCOMPLETE ANNOTATIONS. 

## desc_list contains a number of columns. 
## annotator: the user name of the annotator.
## no_of_anno: number of annotations submitted by the annotator. Sometimes the user submits the same data twice. 
##             This is either a system glitch on the server side or just a user doubleclicking the submit button.
## no_of_positive_anno: number of annotations in which a user indicated a tumor. Ideally this should be half of the dataset size.
## len(unique_url_list): This is the number of files actually annotated by workers. It should be the same as the dataset size, ideally.
## The last colum is the average time to complete a task in seconds. It should be more than 10 and smaller than 100, for example... 

seemingly_valid_annotator_list = []
for i in desc_list:
    if i[1]>40 and i[2]>10 and i[3]>40 and i[4]>12:
        seemingly_valid_annotator_list.append(i[0])
        print(i)
        
len(seemingly_valid_annotator_list)

['A10Q4U3BRHXXPP', 46, 46, 46, 17]
['A153J31AVDX32V', 49, 49, 46, 24]
['A1GKD3NG1NNHRP', 46, 27, 46, 14]
['A23PQYQ6A2I076', 48, 42, 46, 18]
['A2CYXHEA1EX07O', 46, 46, 46, 13]
['A2M1CVZZJAN4T4', 46, 43, 46, 43]
['A31XFBQITA3FAP', 46, 43, 46, 14]
['A3F8UT6178B2A4', 46, 30, 46, 20]
['A3UF6XXFFRR237', 56, 55, 46, 28]
['ACI8PUCF5OPDC', 46, 40, 46, 15]
['AWENQ6RS7ABZ6', 46, 23, 46, 25]


11

In [21]:
## NOW WE LOAD THE COORDINATES FROM ANNOTATION LOGS


basedir = jsondir
data_points = []
for i in seemingly_valid_annotator_list:
    json_path = 'data_' + i + '.json'
    unique_url = []
    with open(basedir + json_path) as json_file:
        data = json.load(json_file)
        for dp in data:
            if dp['url'] not in unique_url:
                unique_url.append(dp['url'])
                
                coords = dict()
                annotator = dp['annotator']
                
                # If there is no POI indicated, we mark the coordinate as x,y = -1, -1.  
                if len(dp['POIs']) > 5:
                    coords = eval(dp['POIs'])[0]
                    x, y = coords['x'], coords['y']
                    no_of_positive_anno = no_of_positive_anno + 1
                else: 
                    x, y = -1, -1
                
                task_duration = math.floor(int(dp['SubmitTime']) - int(dp['DisplayTime'])) / 1000
                data_points.append([json_path, dp['url'], annotator, x, y, task_duration])


In [22]:
## This is how the data looks like. 
dfData = pd.DataFrame.from_records(data_points, columns=['annotationfile', 'image', 'annotator', 'x', 'y', 'duration']) 
dfData.head()


Unnamed: 0,annotationfile,image,annotator,x,y,duration
0,data_A10Q4U3BRHXXPP.json,input/CROWD_TEST\Negative_samples\unseg_op\sub...,A10Q4U3BRHXXPP,601,284,27.996
1,data_A10Q4U3BRHXXPP.json,input/CROWD_TEST\Positive_samples\unsegmented\...,A10Q4U3BRHXXPP,261,271,39.79
2,data_A10Q4U3BRHXXPP.json,input/CROWD_TEST\Negative_samples\unseg_op\sub...,A10Q4U3BRHXXPP,201,262,7.104
3,data_A10Q4U3BRHXXPP.json,input/CROWD_TEST\Positive_samples\unsegmented\...,A10Q4U3BRHXXPP,505,289,8.345
4,data_A10Q4U3BRHXXPP.json,input/CROWD_TEST\Negative_samples\unseg_op\sub...,A10Q4U3BRHXXPP,512,559,23.402


In [2]:
## IMPORTANT - ANNOTATION SCALE NORMALIZATION
# The LUNA dataset contains 512 by 512 images. The gold standard coordinates are also in the same coordinate system (512 by 512).
# However, we collected the data using 800 by 800 images during crowdsourcing. So, we need to convert the coordinate scales.  

# The new 'xy' column is going to contain the normalized coordinates!
for index, row in dfData.iterrows():
    dfData.loc[index,'xy'] = str(math.floor(int(row['x']) * 512/800) ) + "," + str(math.floor(int(row['y']) * 512/800))

NameError: name 'dfData' is not defined

In [24]:
dfData.head()

Unnamed: 0,annotationfile,image,annotator,x,y,duration,xy
0,data_A10Q4U3BRHXXPP.json,input/CROWD_TEST\Negative_samples\unseg_op\sub...,A10Q4U3BRHXXPP,601,284,27.996,384181
1,data_A10Q4U3BRHXXPP.json,input/CROWD_TEST\Positive_samples\unsegmented\...,A10Q4U3BRHXXPP,261,271,39.79,167173
2,data_A10Q4U3BRHXXPP.json,input/CROWD_TEST\Negative_samples\unseg_op\sub...,A10Q4U3BRHXXPP,201,262,7.104,128167
3,data_A10Q4U3BRHXXPP.json,input/CROWD_TEST\Positive_samples\unsegmented\...,A10Q4U3BRHXXPP,505,289,8.345,323184
4,data_A10Q4U3BRHXXPP.json,input/CROWD_TEST\Negative_samples\unseg_op\sub...,A10Q4U3BRHXXPP,512,559,23.402,327357


In [1]:
## PIVOT THE DATA so that we can have individual annotators as rows, and images as columns, and values as the normalized annotation corrdinates. 
dfPivotedData = dfData.pivot(index='annotator', columns='image', values=['xy'])
dfPivotedData.head()

NameError: name 'dfData' is not defined

In [26]:
listOfPoints=[]

In [50]:
## TO DISPLAY AN ANNOTATED IMAGE
for i in range(0,41):
    col_index =i
    print(col_index)
#dfPivotedData.xy.columns[col_index]

## We select the 3rd image in the sequence of 5 to display the annotations on.
    fname = dfPivotedData.xy.columns[col_index].replace('.png', '_3.png').replace('/','\\')
#     print(fname)
    basedir = 'C:\\Users\\P70065719\\Desktop\\Thesis\\'
    fname = basedir + fname[6:]
#     print(fname)

# img = mpimg.imread(fname)
    img=Image.open(open(fname, 'rb'))
    listOfDots = []
    for item in range(0,len(dfPivotedData.xy[dfPivotedData.xy.columns[col_index]])):
        listOfDots.append(dfPivotedData.xy[dfPivotedData.xy.columns[col_index]][item].split(','))

    
    patId, patLay = getPatientIdFromFilePath(fname)

    dfPat = getGoldData(patId, gold_std_relevant, int(patLay))
    goldpointList = []
    for index, row in dfPat.iterrows():
        x = int(math.floor(row['Xnpy']))
        y = int(math.floor(row['Ynpy']))
        goldpointList.append([x,y])
## No normalization is required because we already did that in previous cells. But I will keep the code here anyway.
## It is currently just converting string coordinates to integers. 
    normalizedDots = []
    for i in listOfDots:
        if int(i[0]) > 0: 
            x = math.floor(int(i[0]))
            y = math.floor(int(i[1]))
            normalizedDots.append([x,y])
#     print(len(normalizedDots))
    listOfPoints.append(normalizedDots)
# display.displayImageWithAnnotations(img, pointList=normalizedDots ,displayFigure=False)
    base_img=img
#     base_img=base_img.save('base\\'+str(col_index)+".png")
#     img=annotatePointsOnImage(img,goldpointList,1)
#     base_img=base_img.save('base\\'+str(col_index)+".png")
    img=annotatePointsOnImage(img,normalizedDots,0)
#     img=annotatePointsOnImage(img,goldpointList,1)
    img=annotatePointsOnImage(img,goldpointList,1)

    print(col_index)
    img=img.save('annotated\\'+str(col_index)+".png")
#     base_img=base_img.save('base\\'+str(col_index)+".png")

0
0
1


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
11
11
12
12
13
13
14
14
15
15
16
16
17
17
18
18
19
19
20
20
21
21
22
22
23
23
24
24
25
25
26
26
27
27
28
28
29
29
30
30
31
31
32
32
33
33
34
34
35
35
36
36
37
37
38
38
39
39
40
40


In [28]:
cluster_centroids=[]
print(len(listOfPoints))

41


In [29]:
# from sklearn.cluster import KMeans

# for i in range(len(listOfPoints)):
#     Kmean = KMeans(n_clusters=1)
#     Kmean.fit(listOfPoints[i])
#     center=Kmean.cluster_centers_[0]
#     cluster_centroids.append([int(center[0]),int(center[1])])
cluster_labels=[]
from sklearn.cluster import DBSCAN
for i in range(len(listOfPoints)):
    
    clustering = DBSCAN(eps=5, min_samples=2).fit(listOfPoints[i])
    labels=clustering.labels_
    cluster_labels.append(labels.tolist())


In [30]:
for i in range(len(cluster_labels)):
    print(i,cluster_labels[i])


0 [-1, -1, 0, -1, -1, -1, -1, 0]
1 [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]
2 [-1, 0, -1, 0, -1, -1]
3 [-1, -1, -1, 0, -1, 0, -1, -1, -1]
4 [0, -1, -1, -1, 0, 0, -1, -1]
5 [-1, -1, -1, 0, -1, 0, -1, -1, 0]
6 [-1, -1, 0, -1, -1, 0, 0, 0]
7 [0, -1, 1, -1, 1, 1, -1, 1, 0, -1]
8 [-1, -1, -1, -1, 0, -1, 0, -1, -1]
9 [-1, -1, 0, -1, 1, 0, 1, 0, 0, 0, 0]
10 [-1, 0, 1, 0, 1, 2, 2, -1, -1]
11 [-1, -1, -1, 0, -1, 0, -1, -1]
12 [-1, -1, 0, -1, -1, 0, 0, -1]
13 [-1, 0, 1, -1, -1, 1, 0, -1]
14 [0, -1, 1, -1, 0, -1, -1, 1, 0]
15 [-1, -1, -1, -1, -1, 0, -1, -1, 0]
16 [-1, -1, -1, -1, -1]
17 [-1, -1, -1, -1, -1, 0, -1, 0, -1]
18 [0, -1, -1, -1, -1, 0, -1]
19 [0, -1, 1, -1, 1, 1, 1, 1, 0, -1]
20 [0, 0, -1, -1, 0, -1, -1, 0, 0, 0]
21 [0, 0, 0, 0, -1, 0, 0, -1]
22 [0, -1, 0, -1, 0, 1, 0, 0, 1, 0]
23 [0, -1, -1, -1, -1, 0, 0, 0, -1, -1]
24 [0, -1, 0, -1, -1, 0, -1, 0, 0, -1, 0]
25 [0, -1, -1, -1, 0, -1, 0, 0, -1, 0]
26 [0, 0, 0, -1, -1, 0, -1, 0, -1, 0, 0]
27 [0, -1, -1, 1, 1, -1, 0, 0, 0]
28 [0, -1, 

In [31]:
listOfPoints

[[[238, 157],
  [188, 318],
  [188, 309],
  [193, 299],
  [215, 274],
  [117, 258],
  [80, 191],
  [191, 308]],
 [[137, 238],
  [165, 328],
  [132, 229],
  [340, 318],
  [171, 322],
  [222, 286],
  [358, 296],
  [151, 205],
  [138, 218],
  [140, 204],
  [391, 279]],
 [[294, 138], [362, 229], [347, 181], [364, 227], [368, 231], [182, 176]],
 [[140, 73],
  [410, 203],
  [182, 320],
  [193, 317],
  [398, 228],
  [197, 316],
  [353, 270],
  [358, 273],
  [385, 325]],
 [[402, 261],
  [382, 359],
  [158, 236],
  [172, 353],
  [396, 264],
  [399, 260],
  [389, 268],
  [206, 295]],
 [[224, 249],
  [362, 220],
  [211, 270],
  [161, 279],
  [332, 295],
  [160, 284],
  [307, 336],
  [184, 301],
  [158, 280]],
 [[30, 305],
  [195, 300],
  [202, 300],
  [179, 309],
  [187, 334],
  [202, 300],
  [201, 297],
  [203, 301]],
 [[336, 256],
  [343, 206],
  [190, 220],
  [205, 253],
  [188, 218],
  [190, 218],
  [197, 215],
  [191, 220],
  [336, 252],
  [209, 247]],
 [[184, 318],
  [186, 313],
  [169, 243

In [32]:
len(cluster_labels)

41

In [33]:
# j=0
# for labels in cluster_labels:
#     img=Image.open('annotated\\'+str(j)+'.png')
#     distint_labels=set(labels)
#     for val in distint_labels:
#         if val != -1:
#             indices = [i for i, x in enumerate(distint_labels) if x == val]
#             center_x=0
#             center_y=0
#             for i in indices:
#                 print(listOfPoints[j][i][0],listOfPoints[j][i][1])
#                 center_x=center_x+listOfPoints[j][i][0]
#                 center_y=center_y+listOfPoints[j][i][1]
#             center_x=int(center_x/(len(indices)))
#             center_y=int(center_y/(len(indices)))
#             print(center_x,center_y)
#             draw = ImageDraw.Draw(img)
#             draw.line([(center_x,center_y),(center_x+1,center_y)],fill='red',width=4)
#             img=img.save('annotated\\'+str(j)+'.png')
#     j=j+1
color=['yellow','blue','green','darkred']            
        
j=0
for label in cluster_labels:
    img=Image.open('base\\'+str(j)+'.png')
    draw=ImageDraw.Draw(img)
    distinct_clusters=set(label)
    print(j,distinct_clusters)
    k=0
    for val in distinct_clusters:
        if val!=-1:
            indices=[i for i, x in enumerate(label) if x == val]
            print(indices)
            center_x=0
            center_y=0
            for i in indices:
                center_x=center_x+listOfPoints[j][i][0]
                center_y=center_y+listOfPoints[j][i][1]
            center_x=int(center_x/len(indices))
            center_y=int(center_y/len(indices))
#                 p=listOfPoints[j][i]
#                 draw.line([(p[0],p[1]),(p[0]+1,p[1])],fill=color[k],width=4)
            draw.line([(center_x,center_y),(center_x+1,center_y)],fill='aqua',width=4)
        k=k+1
    img=img.save('DBSCAN_centers\\'+str(j)+'.png')
    print()
    j=j+1
    
            

0 {0, -1}
[2, 7]

1 {-1}

2 {0, -1}
[1, 3]

3 {0, -1}
[3, 5]

4 {0, -1}
[0, 4, 5]

5 {0, -1}
[3, 5, 8]

6 {0, -1}
[2, 5, 6, 7]

7 {0, 1, -1}
[0, 8]
[2, 4, 5, 7]

8 {0, -1}
[4, 6]

9 {0, 1, -1}
[2, 5, 7, 8, 9, 10]
[4, 6]

10 {0, 1, 2, -1}
[1, 3]
[2, 4]
[5, 6]

11 {0, -1}
[3, 5]

12 {0, -1}
[2, 5, 6]

13 {0, 1, -1}
[1, 6]
[2, 5]

14 {0, 1, -1}
[0, 4, 8]
[2, 7]

15 {0, -1}
[5, 8]

16 {-1}

17 {0, -1}
[5, 7]

18 {0, -1}
[0, 5]

19 {0, 1, -1}
[0, 8]
[2, 4, 5, 6, 7]

20 {0, -1}
[0, 1, 4, 7, 8, 9]

21 {0, -1}
[0, 1, 2, 3, 5, 6]

22 {0, 1, -1}
[0, 2, 4, 6, 7, 9]
[5, 8]

23 {0, -1}
[0, 5, 6, 7]

24 {0, -1}
[0, 2, 5, 7, 8, 10]

25 {0, -1}
[0, 4, 6, 7, 9]

26 {0, -1}
[0, 1, 2, 5, 7, 9, 10]

27 {0, 1, -1}
[0, 6, 7, 8]
[3, 4]

28 {0, 1, -1}
[0, 4, 6, 7, 9]
[3, 8]

29 {0, 1, -1}
[2, 4, 6, 8, 9]
[3, 5]

30 {0, 1, -1}
[3, 4, 5]
[6, 7, 9]

31 {0, 1, 2, -1}
[0, 8]
[5, 7]
[6, 9]

32 {0, -1}
[0, 3, 5, 6, 7]

33 {0, 1, -1}
[4, 9]
[5, 7, 10]

34 {0, 1, -1}
[0, 5]
[2, 4, 6, 7, 10]

35 {0, -1}
[0, 4, 6, 8]

3

In [34]:
## This is how we access the pivoted data frame columns.
dfPivotedData.xy.columns # gives all image names (because they are now the column names)
dfPivotedData.xy[dfPivotedData.xy.columns[0]][7] # Gives the value (coordinates) of a particular item in the pivoted table.

'-1,-1'

## Performance check

In [35]:
## dfData includes all annotations (POSITIVE and NEGATIVE images)
dfData.head()
print('The number of annotation points in the entire annotation set: ', len(dfData))


## dfDataPositives contain only the POSITIVE image annotations.  
dfDataPositives = dfData.loc[(dfData['image'].str.contains('Positive') )]
print('The number of annotation points don only on POSITIVE images: ',len(dfDataPositives))   
dfDataPositives.head()

The number of annotation points in the entire annotation set:  506
The number of annotation points don only on POSITIVE images:  231


Unnamed: 0,annotationfile,image,annotator,x,y,duration,xy
1,data_A10Q4U3BRHXXPP.json,input/CROWD_TEST\Positive_samples\unsegmented\...,A10Q4U3BRHXXPP,261,271,39.79,167173
3,data_A10Q4U3BRHXXPP.json,input/CROWD_TEST\Positive_samples\unsegmented\...,A10Q4U3BRHXXPP,505,289,8.345,323184
5,data_A10Q4U3BRHXXPP.json,input/CROWD_TEST\Positive_samples\unsegmented\...,A10Q4U3BRHXXPP,634,466,6.587,405298
7,data_A10Q4U3BRHXXPP.json,input/CROWD_TEST\Positive_samples\unsegmented\...,A10Q4U3BRHXXPP,620,281,25.089,396179
8,data_A10Q4U3BRHXXPP.json,input/CROWD_TEST\Positive_samples\unsegmented\...,A10Q4U3BRHXXPP,203,532,20.378,129340


In [36]:
from carou.annotation.spatial import calculation

# Check agreement of two points. x and y are crowd annotations, x_gold and y_gold are expert annotations from LUNA. 
def checkAgreement(x,y,x_gold,y_gold):
    if calculation.EuclideanDistance2D([x,y],[x_gold, y_gold]) < 50:
        return True
    else:
        return False

# The agreement outcome category. 
def categorizeAgreement(x,y,x_gold,y_gold,threshold=50):
    retType = ''
    if x_gold == -1:
        if x == -1:
            retType = 'TN'
        else: 
            retType = 'FP'
    else:
        if x == -1:
            retType = 'FN'
        else: 
            if calculation.EuclideanDistance2D([x,y],[x_gold, y_gold]) < threshold:
                retType = 'TP'
            else:
                retType = 'FP'
                
    return retType
    


ModuleNotFoundError: No module named 'carou'

In [37]:
## To control all individual annotations against the gold standard and assign outcome categories to each annotation.
listCalc = []
#dfDataPositives
for index, row in dfDataPositives.iterrows():
    if len(row['image'].split('\\'))>2:
        f, l = getPatientIdFromFilePath(row['image'])
        dfGoldOne = getGoldData(f,gold_std_relevant, l)
        gold_x = -1
        gold_y = -1
        if (len(dfGoldOne['Xnpy']) > 0) and (len(dfGoldOne['Ynpy']) > 0):
            gold_x = int(dfGoldOne.iloc[0]['Xnpy'])
            gold_y = int(dfGoldOne.iloc[0]['Ynpy'])
        listCalc.append([index, row['annotator'], row['xy'], gold_x, gold_y, 
              checkAgreement(int(row['xy'].split(',')[0]), int(row['xy'].split(',')[1]), gold_x, gold_y), 
              categorizeAgreement(int(row['xy'].split(',')[0]), int(row['xy'].split(',')[1]), gold_x, gold_y)])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


NameError: name 'checkAgreement' is not defined

In [38]:
## The performance of each annotator. 
dfCalc = pd.DataFrame.from_records(listCalc, columns=['index', 'annotator', 'xy', 'goldx', 'goldy', 'agreement', 'outcome']) 
dfCalc.head()

dfCalc.groupby('annotator')['outcome'].value_counts().unstack().fillna(0)

IndexError: boolean index did not match indexed array along dimension 0; dimension is 0 but corresponding boolean dimension is 1

In [None]:
## PRECISION, RECALL, TP-RATE
dfCalcGrouped = dfCalc.groupby('annotator')['outcome'].value_counts().unstack().fillna(0)

dfCalcGrouped['recall'] = dfCalcGrouped['TP'] / (dfCalcGrouped['TP'] + dfCalcGrouped['FN'])
dfCalcGrouped['precision'] = dfCalcGrouped['TP'] / (dfCalcGrouped['TP'] + dfCalcGrouped['FP'])
dfCalcGrouped['TP_Rate'] = dfCalcGrouped['TP'] / (dfCalcGrouped['TP'] + dfCalcGrouped['FP'] + dfCalcGrouped['FN'])
dfCalcGrouped

In [None]:
dfCalcGrouped = dfCalc.groupby('annotator')['outcome'].value_counts().unstack().fillna(0)

dfCalcGrouped['recall'] = dfCalcGrouped['TP'] / (dfCalcGrouped['TP'] + dfCalcGrouped['FN'])
dfCalcGrouped['precision'] = dfCalcGrouped['TP'] / (dfCalcGrouped['TP'] + dfCalcGrouped['FP'])
dfCalcGrouped['TP_Rate'] = dfCalcGrouped['TP'] / (dfCalcGrouped['TP'] + dfCalcGrouped['FP'] + dfCalcGrouped['TN'] + dfCalcGrouped['FN'])
dfCalcGrouped