## Goal
Take `classify_classifications.csv` and parse it into various dataframes for analysis. Do this in this notebook to keep the length of our analysis notebooks manageable.

@note: Actually, I'd rather just create functions that create/extract dataframes, since extracting JSON columns, saving them to csv, and extracting again, becomes a bit tricky.

In [2]:
import json
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import rc
from dateutil.parser import parse
from datetime import date

import sys 
sys.path.append('../..')
from sf_lib.sf import getFilename, parseTime, extract_task_value, percentageVotesForAnswer
from sf_lib.df import make_df_classify, make_df_task0, make_df_tasks_with_props

%matplotlib notebook

In [3]:
# extract unique object names that resulted from another notebook 
candidate_names_classify = np.loadtxt('../sf_candidate_names__classification-classify.txt', dtype=str)

- ## Parse the dataframe itself:

In [6]:
df = make_df_classify()

In [7]:
df[df.user_name.isnull()]  # returns empty df, so every classification has a user_name associated with it
df[df.user_ip.isnull()]  # also returns empty df

unique_entries = {
    "user_name": df.user_name.unique().shape[0],
    "user_ip": df.user_ip.unique().shape[0],
    "user_id": df.user_id.unique().shape[0]
}

unique_entries

{'user_name': 1783, 'user_ip': 1157, 'user_id': 1136}

Printing the above shows that there's more unique usernames than either ips or ids. Assume multiple people may share an IP, and note that not all classifications have an associated user_id.

---

- ## Create 'task0' dataframe:

In [7]:
df_task0 = make_df_task0(df, candidate_names_classify)

- ## Create 'df_retired' and 'df_with_props' dataframes

In [8]:
df_with_props = make_df_with_props(df, candidate_names_classify)

- ## Create 'task1' dataframe, and merge it with 'task0' dataframe to get 'tasks' dataframe

In [9]:
object_info = pd.read_csv('../../catalogue/sf_spacefluff_object_data.csv', comment="#")

df_tasks_with_props = make_df_tasks_with_props(df, candidate_names_classify, object_info)

In [10]:
# df_tasks[~df_tasks['% None'].isnull() & df_tasks['% None'] > 0]

---

In the end, we end up with the following dataframes:
- `df`: 
        parsed version of the complete data set
        
- `df_galaxy`: 
        filtered version of `df` leaving classifications where task0 == 'galaxy'
        
- `df_retired`, `df_props`: 
        temporary dataframes, both used to create `df_with_props`
        
- `df_task0`:
        Contains the name of each galaxy, the total number of votes, and the percentage of votes for each option from task0.
        
- `df_with_props`:
        version of df_task0 with the properties of each galaxy merged onto it
        
- `df_task1`:
        contains the name of each galaxy, and the percentage of votes for 'fluffy' and 'bright' for task1, asked as a follow-up when people answered 'Galaxy' for task0.
        
- `df_tasks`:
        df_task1 outer joined onto df_task0
        
- `df_tasks_with_props`:
        df_tasks merged with object info (from Venhola's catalogue)