# Preprocess human classification data downloaded from Zooniverse

In [1]:
%matplotlib inline
import pandas as pd
import json
import matplotlib.pyplot as plt
import numpy as np

## The human classification data

Two of the more important columns in the exported data are in JSON format: the subject and the annotations. So, some simple scripts to extract the relevant data. Unfortunately, the JSON seems to have changed over time, so the functions have to be flexible.

In [2]:
def ID_from_subject_data(subject_data):
    # given a string of JSON representing the subject data, extract and return the ID
    sd = json.loads(subject_data)
    sd1 = list(sd.values())[0]
    sd1k = sd1.keys()
    if 'ID' in sd1k:
        # easiest case: there's a subject ID in the JSON
        v = sd1['ID']
    elif 'Filename1' in sd1k:
        # otherwise, we can get it from the filename (I hope)
        v = sd1['Filename1'][3:13]
    else:
        # about 28 case where the subject data are in some weird format that this doesn't catch, which we ignore
        v = ''
        
    return v

In [3]:
# an example
ID_from_subject_data(classification_data['subject_data'][40000])

NameError: name 'classification_data' is not defined

_need to check that this is really working reliably given the number of observations that don't connect_

In [None]:
def user_choice_from_annotations(annotations):
    # given a string of JSON representing the annotation, extract and return the user's annotation
    a = json.loads(annotations)

    # not all users actually recorded a choice
    try:
        v = a[0]['value'][0]['choice']
    except (IndexError, KeyError): 
        # volunteer apparently didn't pick anything 
        v = ''
    
    return v

In [None]:
# example
user_choice_from_annotations(classification_data['annotations'][2])

## Read data downloaded from Zooniverse

In [None]:
classification_data = pd.read_csv('160626 gravity-spy-classifications.csv',parse_dates=[7,],infer_datetime_format=True)

In [None]:
len(classification_data)

## Add new columns to the data frame with the extracted data. 

In [None]:
classification_data['subject_ID'] = classification_data['subject_data'].map(ID_from_subject_data)

In [None]:
classification_data['annotation'] = classification_data['annotations'].map(user_choice_from_annotations)

The uncaught cases. 

In [None]:
classification_data[classification_data.subject_ID == '']

## A few sample statistics...

In [None]:
classification_by_user = classification_data.groupby('user_name').count()

In [None]:
classification_by_user['user_ip'].median()

In [None]:
classification_data.groupby('annotation').count()['classification_id']

## Glitch classes
The coding for the classes is different in the ML and Human classification, so a file to translate between them. Also, there are multiple codes for the same glitch class in the human file, so some lines have repeats. I marked a preferred code in case we ever want to move from the ML data to the code in Gravity Spy. Note that pandas can't handle an NA in an integer column, which is why Model_number is a float.

In [None]:
glitch_classes = pd.read_csv('glitch-classes.csv')

In [None]:
glitch_classes

Example use: make the mean confidences more readable

In [None]:
mean_scores = ML_data.loc[:,['label','confidence']].groupby('label').agg([np.mean,len])
mean_scores['MLID'] = mean_scores.index
mean_scores = pd.merge(mean_scores, glitch_classes[glitch_classes.Preferred==1], on='MLID')
mean_scores

## Add ML labels to classification data
Merge the human data with the translation to the ML coding system and drop most of the columns. I chose to keep user_id instead of user_name in an attempt to make the data more private. It occurs to me that we should filter out internal users who've been debugging or doing demos, but I was assured that they all were serious. Still, the learning parameters are probably different. There are 51224 classifications in total but 27 without a subject_ID due to the problem mentioned above. 

In [None]:
human_data_cols = ['user_id', 'subject_ID', 'created_at', 'annotation', 'workflow_id', 'workflow_version']
human_data = pd.merge(classification_data.loc[classification_data.annotation!='', human_data_cols], \
                      glitch_classes, left_on='annotation', right_on='HCID')
human_data.drop(['annotation','HCID','Preferred'], inplace=True, axis=1)

In [None]:
human_data.head()

In [None]:
len(human_data)

## Save data for reuse

In [None]:
classifications_store = pd.HDFStore('160626 data.h5')
# classifications_store['classification_data'] = classification_data
del classifications_store['human_data']
classifications_store['human_data'] = human_data
classifications_store.close()