# Computing Project Stats and Cleaning Up Classification Exports

When project teams first download classification exports, they usually need to do the following:
 - Calculate some statistics for classifications, classifiers, and the project in general
 - Extract just the classifications they want from the raw export (e.g. remove duplicates, classifications of old workflows, and test classifications from before the project was live)

This notebook will show you how to do both of these things using Python scripts that exist in this repository.

In [1]:
import sys, os
import numpy as np
import pandas as pd
import ujson

print("Python version: %d.%d.%d, numpy version: %s, pandas version: %s." %(sys.version_info[0], 
                                                                           sys.version_info[1], 
                                                                           sys.version_info[2], 
                                                                           np.__version__, 
                                                                           pd.__version__))
print("Originally developed using Py 2.7.11, np v1.11.0, pd v0.19.2")
print("If these versions don't match and stuff breaks, that's probably why.")

Python version: 2.7.11, numpy version: 1.11.0, pandas version: 0.19.2.
Originally developed using Py 2.7.11, np v1.11.0, pd v0.19.2
If these versions don't match and stuff breaks, that's probably why.


The most common project statistics are computed as part of `basic_classification_processing`, which used to be `basic_project_stats` (but was renamed because it does more than that).

In [2]:
from basic_classification_processing import basic_stats_processing, basic_stats_help

As we did for the warm-up notebook "First Look at Classifications", we'll assume the project is called "My Project".

The example file `my-project-classifications.csv` has 50,000 classifications in it, which is enough to extract meaningful statistics but not enough that re-calculating will take very much time.

These are real classifications excerpted from a real project, but the `user_id` values have been changed and do not reflect real Zooniverse user IDs. The `user_name` values are real; these are public.

## Computing project stats

In [3]:
project_name = "my-project"

classification_file = project_name + "-classifications.csv"

# if you want to do this separately you need to also import read_classfile from 
# basic_classification_processing above
#classifications = read_classfile(classification_file)

# compute the most basic stats, not worrying about duplicate or non-live classifications 
# or multiple workflows
basic_stats_processing(classification_file)

Computing project stats using:
   infile: my-project-classifications.csv
Reading classifications from my-project-classifications.csv
Considering all classifications in workflow ids:
[4958 5030 4975]
 and workflow_versions:
[ 17.6   3.8   1.1]
Retaining all non-live classifications in analysis.

Overall:

50000 classifications of 7568 subjects by 919 classifiers,
742 logged in and 177 not logged in, from 949 unique IP addresses.
46393 classifications were from logged-in users, 3607 from not-logged-in users.

That's 6.61 classifications per subject on average (median = 6.0).
The most classified subject has 212 classifications; the least-classified subject has 1.

Median number of classifications per user: 18.00
Mean number of classifications per user: 54.41

Top 10 most prolific classifiers (with classification counts):
user_name
MerylPG          1028
Velski            907
Lampyrichard      879
DonnaNoble888     770
SlowLoris         658
Quinacridone      609
jrussill          581
rcmill

### Making sense of the stats output

There's a lot of information here. Let's break it down.

 1. **Workflow information**
 -- The first thing the program prints is a list of all workflow IDs with at least one classification in the file, followed by the versions of those workflows (note, it doesn't match them up). This tells you what this export covers. The example "My Project" file has 3 workflows, with one version of each, but this can vary by project. 
 
 2. **Classification and Classifier counts, averages, medians**
 -- This is meant to give you a sense of who is classifying and how many classifications they're doing. The IP address count is meant to help you estimate how many of the not-logged-in classifiers are people who went to the project, started classifying, and logged in later (their not-logged-in classifications will have the same IP address and browser session). The difference between means and medians gives you a sense of how skewed the distributions of classifications (per classifier and per subject) are.
 
 3. **Top 10 most prolific classifiers**
 -- Leaderboards can be useful just to see who's doing what in your project, but *we do not recommend sharing them publicly* because they tend to encourage people to prioritize getting on the leaderboard, sometimes by sacrificing accuracy in favor of classifying fast. Plus, it may not be a great idea to imply that the people who classify the most are also the most valuable volunteers. However, this can be useful internally so you know who your most prolific classifiers are (they aren't always the people who post the most on Talk).
 
 4. **Gini coefficient**
 -- The Gini coefficient measures inequality in distributions of things. It was originally conceived for economics (e.g. where is the wealth in a country? in the hands of many citizens or a few?), but it's just as applicable to many other fields. In this case we'll use it to see how classifications are distributed among classifiers.
 
 G = 0 is a completely even distribution (everyone does the same number of classifications), and ~1 is uneven (~all the classifications are done by one classifier). Typical values of the Gini for healthy Zooniverse projects (Cox et al. 2015) are in the range of 0.7-0.9, although this can vary by project discipline (Spiers et al. 2019). That range is generally indicative of a project with a loyal core group of  volunteers who contribute the bulk of the classification effort, but balanced out by a regular influx of new classifiers trying out the project, from which you continue to draw to maintain a core group of prolific classifiers. Once your project is fairly well established, you can compare it to past Zooniverse projects to see how you're doing.
 
 If your G is << 0.7, you may be having trouble recruiting classifiers into a loyal group of volunteers. People are trying it, but not many are staying. If your G is > 0.9, it's a little more complicated. If your total classification count is lower than you'd like it to be, you may be having trouble recruiting classifiers to the project, such that your classification counts are dominated by a few people. But if you have G > 0.9 and plenty of classifications, this may be a sign that your loyal users are -really- committed, so a very high G is not necessarily a bad thing.

 Of course the Gini coefficient is a simplified measure that doesn't always capture subtle nuances and so forth, but it's still a useful broad metric.

 5. **Classification dates and highest classification ID**
 -- It's useful to make sure your classification dates cover the time period you think they should, and the classification ID may be useful if you need to request a data export for just the most recent classifications since a given classification ID.
 
 6. **File with user classification counts**
 -- This is always saved when you run the stats file, mainly because it's generated necessarily as a result of computing these stats, and it can come in handy for other things, **e.g. generating an author list**.
 
Note, however, that these are only what the function prints *by default*, and in the default a lot of options are turned off. 

Let's see the full set of options:

In [4]:
# print help for basic_stats_processing()

basic_stats_help()


Usage: basic_stats_processing(classfile_in, workflow_id=-1, workflow_version=-1, time_elapsed=False, output_csv=False, remove_duplicates=False, keep_nonlive=True, keep_allcols=False, verbose=True)
      classifications_infile is a Zooniverse (Panoptes) classifications data export CSV.

  Optional inputs:
    workflow_id=N
       specify the program should only consider classifications from workflow id N
    workflow_version=M
       specify the program should only consider classifications from workflow version M
       (note the program will only consider the major version, i.e. the integer part)
    workflow_ver_min=Mmin, workflow_ver_max=Mmax
       specify the program should consider classifications from workflow version >= Mmin
       or <= Mmax, or Mmin <= workflow version <= Mmax, if both are specified
       Note specifying either a min or max supersedes specifying a workflow_version.
    outfile_csv=filename.csv
       if you want the program to save a sub-file with only class

### Exploring the different options

Let's say we want to run the stats function but only consider classifications from a specific workflow (say, `workflow_id = 5030`) *and* remove duplicate classifications *and* estimate the total amount of human effort the classifiers contributed during those classifications.

In [5]:
basic_stats_processing(classification_file, workflow_id=5030, remove_duplicates=True, 
                       time_elapsed=True)

Computing project stats using:
   infile: my-project-classifications.csv
Reading classifications from my-project-classifications.csv
Considering only workflow id 5030
Considering all classifications in workflow_versions:
[ 3.8]
Retaining all non-live classifications in analysis.
Found 706 duplicate classifications (3.22 percent of total).
Duplicates removed from analysis (465 unique user-subject-workflow groups).

Overall:

21191 classifications of 2629 subjects by 418 classifiers,
338 logged in and 80 not logged in, from 404 unique IP addresses.
19915 classifications were from logged-in users, 1276 from not-logged-in users.

That's 8.06 classifications per subject on average (median = 7.0).
The most classified subject has 141 classifications; the least-classified subject has 1.

Median number of classifications per user: 11.00
Mean number of classifications per user: 50.70

Top 10 most prolific classifiers (with classification counts):
user_name
Velski          876
rcmills1707     579

**Note** how the statistics have changed and are only calculated for the non-duplicated classifications from this specific workflow.

**Note 2:** when cleaning duplicates, the program saves the *first* classification from the combination of `user_name + subject_id + workflow_id` that's duplicated.

## Cleaning up the classifications export

Since we're having to clean up the classifications in order to compute the stats, why not print out that file? Most data reduction codes will require classifications that all have the same structure, i.e., which are all from the same `workflow_id` and `workflow_version`. In addition, it's usually best to ignore duplicate classifications. Rather than doing all this work again when we need to reduce (i.e., aggregate) the data later, we may as well print out the cleaned file now. 

By default, the program only reads in the columns it needs for the stats, which saves a *lot* of memory when dealing with large files (e.g. millions of classifications). We need to turn off that option in order to produce a file that actually contains the annotations.

In [6]:
# define a new outfile name with some information that will help us later when aggregating the data
outfile_cleanclass = project_name + "-classifications-workflow5030-version3.8-nodups.csv"

# since we've already printed the stats to the screen above, don't re-print them
basic_stats_processing(classification_file, workflow_id=5030, workflow_version=3.8, 
                       remove_duplicates=True, outfile_csv=outfile_cleanclass, 
                       keep_allcols=True, verbose=False)

Reading classifications from my-project-classifications.csv
File with used subset of classification info written to my-project-classifications-workflow5030-version3.8-nodups.csv .


### Notes on using the `basic_classification_processing` scripts

`basic_classification_processing.py` is written so that you can do a few things independently of computing *all* the stats, but also you can use the `basic_stats_processing()` function to do everything you need. You can also run it from the command line: try 

 `%> python basic_classification_processing.py` 

to see the syntax in that case. You can run this script each time you are about to aggregate your data on the latest classification export to produce a clean export with no duplicates (and only live classifications, if you wish) or to split a full project export into multiple based on workflows, or to extract only a specific workflow version from a workflow-specific extract. And, with each of these, if you keep `verbose` on, you'll get the stats for free.