In [1]:
import pandas as pd

from sklearn.metrics import cohen_kappa_score

# Assessing the Validity of the Valuation Exclusion

During the data extraction and coding step, we excluded attributes which *do not imply valuation*, i.e., where the values of the attribute do not inform about the quality of the activity. To assess the validity of this step, a second researcher performed the same task independently. This notebook calculates the agreement between the two raters.

In [2]:
file_name: str = './r3a-data-extraction.xlsx'

In [3]:
def calc_percentage_agreement(rating1: list, rating2: list) -> float:
    """Calculate the percentage agreement between two ordered lists of ratings.

    parameters:
        rating1 - list of values of one rater
        rating2 - list of values of another rater

    returns
        percentage agreement - the fraction of perfect matches in the two ratings between 0 and 1
        ValueError - raised if the two lists are not of equal length
    """

    if len(rating1) != len(rating2):
        raise ValueError(f'The two ratings to compare are of inequal length ({len(rating1)} vs. {len(rating1)}).')
    
    n: int = len(rating1)
    agreement: float = 0
    for (r1, r2) in zip(rating1, rating2):
        if r1 == r2:
            agreement += 1

    return agreement/n

def calc_bennetts_s_score(rating1: list, rating2: list, labels: list) -> float:
    """Calculate the Bennett's S score, which represents the agreement between two rater over two ratings given a set of possible labels.
    
    parameters:
        rating1 - list of values of one rater
        rating2 - list of values of another rater
        labels - list of potential values

    returns
        s score - Bennett's S score of agreement between 0 and 1
        ValueError - raised if the two lists are not of equal length
    """
    k = len(labels)
    p = calc_percentage_agreement(rating1, rating2)
    return (k/(k-1)) * (p-(1/k))  


## Comparison 1

### Data Loading

Firstly, we load the data from the excel sheet of extractions.

In [4]:
rating1 = pd.read_excel(file_name, sheet_name='Data', usecols=['ID', 'Attribute Description', 'Val'])
rating2 = pd.read_excel(file_name, sheet_name='Valuation Overlap v1', usecols=['ID', 'Dependent Variable', 'No valuation'])

Then, we filter the ratings of the first researcher to contain only those that were also rated by the second researcher.

In [5]:
relevant_ids = rating2['ID'].values
relevant_dependent_variables = rating2['Dependent Variable'].values
rating1_relevant = rating1[(rating1['ID'].isin(relevant_ids)) & 
                           (rating1['Attribute Description'].isin(relevant_dependent_variables))]

In [6]:
rating1_relevant = rating1_relevant.drop_duplicates().set_index('ID')
rating2 = rating2.set_index('ID')

Finally, we merge the two data frames to obtain one data frame containing both ratings, now named `R1` and `R2` respectively.

In [7]:
rating1_relevant.rename(columns={'Val': 'R1'}, inplace=True)
rating2.drop(columns='Dependent Variable', inplace=True)
rating2.rename(columns={'No valuation': 'R2'}, inplace=True)
overlap1 = pd.concat([rating1_relevant, rating2], axis=1)

### Calculating Agreement

Next, we calculate the agreement of the ratings of both researchers by the means of percentage agreement[1], Cohen's Kappa [2], and Bennett's S-score [3].

Percentage agreement is the simplest type of inter-rater reliability. It suffers from the fact that it does not account for agreement caused by chance. Cohen's Kappa accounts for agreement caused by chance but samples the expected marginal distributions from the data directly. Bennett's S-score is a recommended alternative to Cohen's Kappa since it does account for agreement caused by chance but does assume an even marginal distribution. We report all three measures for completeness sake.

[1] Holsti, O. R. (1969). Content analysis for the social sciences and humanities. Reading. MA: Addison-Wesley (content analysis).
[2] Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1), 37-46.
[3] Bennett, E. M., Alpert, R., & Goldstein, A. C. (1954). Communications through limited-response questioning. Public Opinion Quarterly, 18(3), 303-308.

In [8]:
percentage_agreement = calc_percentage_agreement(overlap1['R1'], overlap1['R2'])
cohens_kappa = cohen_kappa_score(overlap1['R1'], overlap1['R2'], labels=[True, False])
s_score = calc_bennetts_s_score(overlap1['R1'], overlap1['R2'], labels=[True, False])

print(f"The two raters achieved a percentage agreement of {percentage_agreement:.2%}, a Cohen's Kappa agreement of {cohens_kappa:.2%}, and a Bennett's S-Score of {s_score:.2%}.")

The two raters achieved a percentage agreement of 66.67%, a Cohen's Kappa agreement of 33.33%, and a Bennett's S-Score of 33.33%.


## Comparison 2

The Cohen's Kappa agreement of 33.33% is very poor. The two researchers, hence, discussed the rating criteria once more and performed the labeling task again.

### Data Loading

The results of the second rating are stored in the sheet `Valuation Overlap v2`.

In [9]:

rating3 = pd.read_excel(file_name, sheet_name='Valuation Overlap v2', usecols=['ID', 'No valuation']).set_index('ID')

In [10]:
rating3.rename(columns={'No valuation': 'R2'}, inplace=True)
overlap2 = pd.concat([rating1_relevant, rating3], axis=1)

### Calculating Agreement

Again, we calculate the agreement of the two raters using the three metrics [1,2,3].

In [11]:
percentage_agreement = calc_percentage_agreement(overlap2['R1'], overlap2['R2'])
cohens_kappa = cohen_kappa_score(overlap2['R1'], overlap2['R2'], labels=[True, False])
s_score = calc_bennetts_s_score(overlap2['R1'], overlap2['R2'], labels=[True, False])

print(f"The two raters achieved a percentage agreement of {percentage_agreement:.2%}, a Cohen's Kappa agreement of {cohens_kappa:.2%}, and a Bennett's S-Score of {s_score:.2%}.")

The two raters achieved a percentage agreement of 91.67%, a Cohen's Kappa agreement of 83.33%, and a Bennett's S-Score of 83.33%.


The percentage agreement is high, but unreliable in general. The Cohen's Kappa score and Bennet's S-score are equal because of the even marginal distributions. Both values are sufficiently high to validate a common understanding of the subjective task, now.