<div style="font-size: 2rem;">Take `classify_classifications.csv` and parse into various dataframes for analysis. Do this here to keep the length of our analysis notebooks manageable.</div>

In [77]:
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import rc
from dateutil.parser import parse
from datetime import date

import sys 
sys.path.append('..')
from sf import getFilename, parseTime, extract_task_value, percentageVotesForAnswer

%matplotlib inline

- ## Parse the dataframe itself:

In [6]:
# extract unique object names that resulted from another notebook 
candidate_names_classify = np.loadtxt('../sf_objectImageStrings__classification-classify.txt', dtype=str)

In [7]:
# load dataframe
df = pd.read_csv('../../SpaceFluff/zooniverse_exports/classify-classifications.csv', delimiter=",")

# JSON parse the columns that were stringified
columns_to_parse = ['annotations', 'subject_data', 'metadata']

for column in columns_to_parse:
    df[column] = df[column].apply(json.loads)
    
# extract filename, task0 and task1 values to new dataframe columns
df['Filename'] = df['subject_data'].apply(getFilename)
df['Task0'] = df['annotations'].apply(lambda x: extract_task_value(0, x))
df['Task1'] = df['annotations'].apply(lambda x: extract_task_value(1, x))

# finally, remove all rows where task0 wasn't answered (because the row, then, is useless)
df = df[~df['Task0'].isnull()]

# filter out classifications from beta
df['created_at'] = df['created_at'].apply(parseTime)
end_of_beta = pd.Timestamp(date(2020,10,20), tz='utc')
df = df[df['created_at'] > end_of_beta]

# create temporary isRetired and alreadySeen rows
df['isRetired'] = df['metadata'].apply(lambda x: x['subject_selection_state']['retired'])
df['alreadySeen'] = df['metadata'].apply(lambda x: x['subject_selection_state']['already_seen'])

# remove rows where isRetired or alreadySeen
df = df[~df['isRetired'] & ~df['alreadySeen']]

# remove isRetired and alreadySeen columns since they're obsolete hereafter
df.drop(['isRetired', 'alreadySeen'], axis=1, inplace=True)

---

In [8]:
df[df.user_name.isnull()]  # returns empty df, so every classification has a user_name associated with it
df[df.user_ip.isnull()]  # also returns empty df

unique_entries = {
    "user_name": df.user_name.unique().shape[0],
    "user_ip": df.user_ip.unique().shape[0],
    "user_id": df.user_id.unique().shape[0]
}

unique_entries

{'user_name': 1783, 'user_ip': 1157, 'user_id': 1136}

Printing the above shows that there's more unique usernames than either ips or ids. Assume multiple people may share an IP, and note that not all classifications have an associated user_id.

---

- ## Create 'task0' dataframe:

In [9]:
# group df by filename, so that each group contains only rows belonging to that object
gr = df.groupby('Filename')

# create empty list to push results to
task0Values = []

# loop over every group created above to accumulate 'task 0' votes ('galaxy'/'group of objects'/'something else')
for objectName in candidate_names_classify:
    try:
        task0 = gr.get_group(objectName)['Task0']

        counts = task0.value_counts().to_dict()

        countObj = {
            "name": objectName,
            "counts": counts,
        }

        task0Values.append(countObj)
    except:
        continue
        
df_task0 = pd.DataFrame(task0Values)

answer_types = ['Galaxy', 'Group of objects (Cluster)', 'Something else/empty center']

df_task0['# votes'] = df_task0['counts'].apply(lambda x: sum(x.values()))

for ans_type in answer_types:
    vote_percentage_column = df_task0['counts'].apply(lambda x: percentageVotesForAnswer(x, ans_type))
    df_task0['% votes {}'.format(ans_type)] = vote_percentage_column
    
df_task0['name'] = df_task0['name'].apply(lambda x: x[:-9])

# filter dataframe and only leave objects with more than 5 votes
df_task0 = df_task0[df_task0['# votes'] > 5]

- ## Create 'df_retired' and 'df_with_props' dataframes

In [11]:
def extract_retired_info(subject_data):
    '''
        @param subject_data: (dataframe 'subject_data' column)
    '''
    return list(subject_data.values())[0]["retired"]

In [12]:
df["retired"] = df["subject_data"].apply(extract_retired_info)
df_retired = df[~df["retired"].isnull()]

gr_retired = df_retired.groupby(["Filename"])  # group by filename
props = ["R", "RA", "DEC", "G-I"]              # extract object properties

props_list = []

for objectName in candidate_names_classify:
    # get group
    try:
        row = gr_retired.get_group(objectName)['subject_data']

        # get first entry in the group (props should be the same for every entry since they all describe the same object)
        firstEntry = row.iloc[0]
        values  = list(firstEntry.values())[0]

        # create object with name, properties
        entry = {'name': objectName[:-9]}

        for key in props:
            entry[key] = values[key]

        props_list.append(entry)
    except:
        continue
        
df_props = pd.DataFrame(props_list)

df_with_props = df_task0.merge(df_props, how='outer')

- ## Create 'task1' dataframe, and merge it with 'task0' dataframe to get 'tasks' dataframe

In [75]:
# create a temporary dataframe containing only classifications where 'task0' == 'Galaxy'
# df_galaxy = df[(df['Task0'] == 'Galaxy') & (df['annotations'].map(len) > 1)]
df_galaxy = df[(df['Task0'] == 'Galaxy')]
galaxy_names = df_galaxy['Filename']

gr_by_name = df.groupby(['Filename'])

galaxy_task1_values = []

for name in set(galaxy_names):
    group = gr_by_name.get_group(name)        # get all classifications of this object from df
    group = group[group['Task0'] == 'Galaxy'] # select only rows where task0 was answered with 'galaxy'
    
    rowObj = {}
    
    # add 'fluffy' and 'bright' rows
    for answer in ['Fluffy', 'Bright']:
        rowObj['% {}'.format(answer)] = round(list(group['Task1']).count(answer)*100/group.shape[0], 1)
    
    # also manually add 'None' row since None is parsed to NaN otherwise
    none_count = group[group['Task1'].isnull()].shape[0]
    rowObj['% None'] = round(none_count*100/group.shape[0], 1)
    rowObj['name'] = name[:-9]  # add object's name to rowObj
    
    galaxy_task1_values.append(rowObj)  # append rowObj to list

df_task1 = pd.DataFrame(galaxy_task1_values)

df_tasks = df_task1.merge(df_task0, on='name', how='outer')

object_info = pd.read_csv('../../catalogue/sf_spacefluff_object_data.csv', comment="#")

# merge properties onto dataframe
df_tasks_with_props = df_tasks.merge(object_info, how='outer', on='name')

# filter out objects without actual votes
df_tasks_with_props = df_tasks_with_props[~df_tasks_with_props['# votes'].isnull()]

In [51]:
# df_tasks[~df_tasks['% None'].isnull() & df_tasks['% None'] > 0]

---

In the end, we end up with the following dataframes:
- `df`: 
        parsed version of the complete data set
        
- `df_galaxy`: 
        filtered version of `df` leaving classifications where task0 == 'galaxy'
        
- `df_retired`, `df_props`: 
        temporary dataframes, both used to create `df_with_props`
        
- `df_task0`:
        Contains the name of each galaxy, the total number of votes, and the percentage of votes for each option from task0.
        
- `df_with_props`:
        version of df_task0 with the properties of each galaxy merged onto it
        
- `df_task1`:
        contains the name of each galaxy, and the percentage of votes for 'fluffy' and 'bright' for task1, asked as a follow-up when people answered 'Galaxy' for task0.
        
- `df_tasks`:
        df_task1 outer joined onto df_task0
        
- `df_tasks_with_props`:
        df_tasks merged with object info (from Venhola's catalogue)

- ### Export df_tasks_with_props

In [78]:
df_tasks_with_props.to_csv('./tasks_with_props.csv')

In [79]:
df.to_csv('./df.csv')