# RateMyPdf: Expert rankings analysis

This notebook walks you through our process for analyzing our complexity score compared to expert rankings of court forms.

Before you run this notebook, be sure that you install the FormFyxer (see [this package's README.md](../README.md) for full instructions), and then run `pip install -r analysis_requirements.txt`, so you have all the necessary dependencies installed.

In [125]:
import os
import requests
import shutil
import pandas as pd
import numpy as np
import pingouin as pg
from sklearn.metrics import cohen_kappa_score

from formfyxer import lit_explorer, pdf_wrangling

First, load the reviewer data. There are more columns in the excel spreadsheet, but the important ones for our analysis are:
* the name of the reviewer (anonymized as Reviewer 1, 2, etc.)
* the full URL of the form. We just look at the ending file for uniqueness
* the reviewer's rating of how complex the form is. The full text prompt is shown below, but we'll call this "ratings" or "rating_column"
* the reviewer's rating of how good the form. The full text is also shown below, but we'll call this "goodness", or "good_column".

In [117]:
all_expert = pd.read_excel("RateMyPDF Individual Form Expert Benchmarks(1-176).xlsx")
name_column = "Full name"
good_column="good_column"
rating_column = 'ratings'
all_expert = all_expert.rename(columns={
  "What form are you scoring? Please include the full URL, like: https://courtformsonline.org/forms/6e420f1b3575cfd8ef94b71977da9e38252e3395a78439709c760de4.pdf\n": "form_url",
  "From 1-5, with 1 being the easiest and 5 being the hardest, how complex do you think this form is?\n ": rating_column,
  "From 1-5 stars, with 5 being the best, how good a form do you think this is? Use any criteria that make sense to you.\n": good_column,
})
form_column = 'form_name'
all_expert[form_column] = all_expert['form_url'].apply(lambda y: y.split("/")[-1].strip())

Now that we have the data. Let's look at some simple stats, like the mean and standard deviation of scores that each reviewer gave. Each reviewer was assigned between 20 and 35 forms to review, and each form got between 3 and 6 reviews.

In [118]:
all_expert[[name_column, rating_column, good_column]].groupby(name_column).agg(['count', 'mean', 'std'])

Unnamed: 0_level_0,ratings,ratings,ratings,good_column,good_column,good_column
Unnamed: 0_level_1,count,mean,std,count,mean,std
Full name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Reviewer 1,30,3.2,0.484234,30,2.066667,0.253708
Reviewer 2,30,2.333333,1.061337,29,3.517241,0.737791
Reviewer 3,35,1.742857,1.010034,35,4.257143,1.09391
Reviewer 4,25,2.68,1.10755,25,3.2,1.0
Reviewer 5,35,1.914286,0.981338,35,3.342857,0.591253
Reviewer 6,20,1.85,1.136708,20,3.2,1.281447


Next, we normalize the rating and goodness over each reviewer. Say that a certain review is judges everything to be easy, meaning they only give 1s and 2s, even though the complexity is from 1 to 5, but another reviewer gives only 4s and 5s. They might agree that certain forms are more complex that others, but their scores won't. 

As we can see by the data above, each reviewer had a very different mean, in both the ratings and the goodness column. By normalizing the ratings and goodness, we can more accurately compare each reviewer's score to each other's.

In [138]:
df = all_expert[[form_column, rating_column, name_column, good_column]]
def normalize_df(df, *, group_by, to_normalize, output_col):
  mean_col = f"{to_normalize}_mean"
  stddev_col = f"{to_normalize}_stddev"
  mean_and_stddev = (
      df.groupby(group_by)[to_normalize]
      .agg(["mean", "std"])
      .rename(columns={"mean": mean_col, "std": stddev_col})
      .reset_index()
  )
  df = pd.merge(df, mean_and_stddev, on=group_by)
  df[output_col] = (df[to_normalize] - df[mean_col]) / df[stddev_col]
  return df.drop([mean_col, stddev_col], axis='columns')

df = normalize_df(df, group_by=name_column, to_normalize=rating_column, output_col="z_rating")
df = normalize_df(df, group_by=name_column, to_normalize=good_column, output_col="z_goodness")

# Since we had 5 categories at the beginning, we will bin the z-scores back into 5 bins;
z_score_bins = [-1.5, -0.5, 0.5, 1.5]
# z_score_bins = [-2, -1, 0, 1, 2, 3]
# np.digitize returns 0 when val < -1.5, so add 1 to everything to shift back to 1 through 5
df["z_rating_binned"] = np.digitize(df["z_rating"], bins=z_score_bins) + 1 
df["z_goodness_binned"] = np.digitize(df["z_goodness"], bins=z_score_bins) + 1

The first statement that we want to test is that "each expert reviewer correlates to each other". This shows that looking at multiple expert rankings is useful. To do this, we use [Intraclass correlation coefficient](https://rowannicholls.github.io/python/statistics/agreement/intraclass_correlation.html); it "assesses the reliability of ratings by comparing the variability of different ratings of the same subject to the total variation across all ratings and all subjects" (from the [pingouin documentation](https://pingouin-stats.org/build/html/generated/pingouin.intraclass_corr.html)). Since each form was reviewed by a different sub-set of all reviewers, we'll look at the ICC1, or "single random raters".

In [139]:
expert_ratings_results = pg.intraclass_corr(data=df, targets=form_column, raters=name_column, ratings="z_rating_binned", nan_policy='omit')
print(expert_ratings_results.set_index('Type').loc[["ICC1"], ['ICC', 'pval']].round(4))

         ICC    pval
Type                
ICC1  0.2675  0.0301


Something that we considered when creating our complexity score was what should we be measuring? We decided to measure the complexity of a form specifically, instead of whether or not a form was objectively "good". We asked each reviewer both how complex they thought each form was, and also how good they thought each form was. We can look at the ICC1 of the goodness rating; it does indicate that there could higher expert agreement complex compared to goodness. However, the p-value is too high to draw any conclusions from the data.

In [140]:
expert_goodness_results = pg.intraclass_corr(data=df, targets=form_column, raters=name_column, ratings="z_goodness_binned", nan_policy='omit')
print(expert_goodness_results.set_index('Type').loc[["ICC1"], ['ICC', 'pval']].round(4))

         ICC    pval
Type                
ICC1  0.1554  0.1103


In [126]:
if not os.path.exists("/tmp/ratemypdf_analysis"):
    os.mkdir("/tmp/ratemypdf_analysis")

if not os.path.exists("/tmp/ratemypdf_analysis/labeled"):
    os.mkdir("/tmp/ratemypdf_analysis/labeled")

mean_per_form = (
    df.groupby("form_name")
    .mean()
    .sort_values(by="z_rating", ascending=True)
    .reset_index()
)

def calc_score(fname):
    """Calculates our complexity score for the same forms that the experts rated.
    If not already downloaded and analysized, will be downloaded and labeled"""
    local_filename = "/tmp/ratemypdf_analysis/labeled/" + fname
    if not os.path.exists(local_filename):
        full_url = "https://courtformsonline.org/forms/" + fname
        download_loc = "/tmp/ratemypdf_analysis/" + fname
        with requests.get(full_url, stream=True) as r:
            with open(download_loc, 'wb') as f:
                shutil.copyfileobj(r.raw, f)
        all_fields = [f for f_in_page in pdf_wrangling.get_existing_pdf_fields(download_loc) for f in f_in_page]
        if not all_fields:
            pdf_wrangling.auto_add_fields(download_loc, local_filename)
        else:
            pdf_wrangling.auto_rename_fields(download_loc, local_filename)
    stats = lit_explorer.parse_form(local_filename)
    return lit_explorer.form_complexity(stats)


mean_per_form["complexity_score"] = mean_per_form[form_column].apply(calc_score)
mean = np.mean(mean_per_form["complexity_score"])
stddev = np.std(mean_per_form["complexity_score"])
mean_per_form["z_complexity_score"] = (mean_per_form["complexity_score"] - mean) / stddev

  df.groupby("form_name")


Detecting Sentences...
Starting to find passives...
Detecting Sentences...
Starting to find passives...
Detecting Sentences...
Starting to find passives...
Detecting Sentences...
Starting to find passives...
Detecting Sentences...
Starting to find passives...
Detecting Sentences...
Starting to find passives...
Detecting Sentences...
Starting to find passives...
Detecting Sentences...
Starting to find passives...
Detecting Sentences...
Starting to find passives...
Detecting Sentences...
Starting to find passives...
Detecting Sentences...
Starting to find passives...
Detecting Sentences...
Starting to find passives...
Detecting Sentences...
Starting to find passives...
Detecting Sentences...
Starting to find passives...
Detecting Sentences...
Starting to find passives...
Detecting Sentences...
Starting to find passives...
Detecting Sentences...
Starting to find passives...
Detecting Sentences...
Starting to find passives...
Detecting Sentences...
Starting to find passives...
Detecting Se

Now that we have the complexity score, we can compare it to the other reviews. We'll do so in a few ways:
* treating the algorithm as an expert reviewer, and comparing the ICC to just the expert reviewers
  * in addition to this, we'll do the same thing but with random scores as the 7th reviewer, and the mean of all reviewers as the 7th reviewer. Our algorithm should fall between those two.
* averaging the scores each reviewer gave to their forms, and treating it as a single "mega-reviewer" score. We'll compare this to our algorithm's scores to the forms, and see the ICC there.

In [141]:
algo_scores = mean_per_form.copy(deep=True)
algo_scores[name_column] = "Algo"
algo_scores["z_rating"] = algo_scores["z_complexity_score"]
algo_scores['z_rating_binned'] = np.digitize(algo_scores['z_rating'], bins=z_score_bins) + 1
df_algo_as_reviewer = pd.concat([df, algo_scores])

rand_scores = mean_per_form.copy(deep=True)
rand_scores[name_column] = 'random'
rand_scores['z_rating'] = np.random.rand(len(rand_scores)) * 5 - 2.5
rand_scores['z_rating_binned'] = np.digitize(rand_scores['z_rating'], bins=z_score_bins) + 1
df_rand_as_reviewer = pd.concat([df, rand_scores])

avg_scores = mean_per_form.copy(deep=True)
avg_scores[name_column] = 'avg'
avg_scores['z_rating'] = avg_scores['z_rating']
avg_scores['z_rating_binned'] = np.digitize(avg_scores['z_rating'], bins=z_score_bins) + 1
df_avg_as_reviewer = pd.concat([df, avg_scores])

algo_as_expert_results = pg.intraclass_corr(data=df_algo_as_reviewer, targets=form_column, raters=name_column, ratings="z_rating_binned", nan_policy='omit')
rand_as_expert_results = pg.intraclass_corr(data=df_rand_as_reviewer, targets=form_column, raters=name_column, ratings="z_rating_binned", nan_policy='omit')
avg_as_expert_results = pg.intraclass_corr(data=df_avg_as_reviewer, targets=form_column, raters=name_column, ratings="z_rating_binned", nan_policy='omit')

#print(df_algo_grader.groupby(name_column).mean())
print(f"---\nExperts only:\n {expert_ratings_results.set_index('Type').loc[['ICC1'], ['ICC', 'pval']].round(4)}")
print(f"---\nAlgorithm as expert:\n {algo_as_expert_results.set_index('Type').loc[['ICC1'], ['ICC', 'pval']].round(4)}")
print(f"---\nRandom as expert:\n {rand_as_expert_results.set_index('Type').loc[['ICC1'], ['ICC', 'pval']].round(4)}")
print(f"---\nAverage as expert:\n {avg_as_expert_results.set_index('Type').loc[['ICC1'], ['ICC', 'pval']].round(4)}")

---
Experts only:
          ICC    pval
Type                
ICC1  0.2675  0.0301
---
Algorithm as expert:
          ICC    pval
Type                
ICC1  0.3359  0.0055
---
Random as expert:
          ICC    pval
Type                
ICC1  0.2211  0.0345
---
Average as expert:
          ICC    pval
Type                
ICC1  0.2865  0.0128


In [142]:
tmp_df_algo = pd.DataFrame([], columns=["reviewer", "rating"])
tmp_df_algo["rating"] = mean_per_form["z_complexity_score"]
tmp_df_algo["reviewer"] = "algo"
tmp_df_algo["idx"] = tmp_df_algo.index
tmp_df_mega = pd.DataFrame([], columns=["reviewer", "rating"])
tmp_df_mega["rating"] = mean_per_form["z_rating"]
tmp_df_mega["reviewer"] = "expert"
tmp_df_mega["idx"] = tmp_df_mega.index
algo_vs_mega = pd.concat([tmp_df_algo, tmp_df_mega])
algo_vs_mega["rating_binned"] = np.digitize(algo_vs_mega["rating"], bins=z_score_bins) + 1

algo_vs_mega_results = pg.intraclass_corr(data=algo_vs_mega, targets="idx", raters="reviewer", ratings="rating_binned")

print(f"---\nAlgo and experts as a single unit:\n {algo_vs_mega_results.set_index('Type').loc[['ICC3'], ['ICC', 'pval']].round(4)}")

---
Algo and experts as a single unit:
          ICC    pval
Type                
ICC3  0.3151  0.0224
