These notebooks make use of the CrowdTruth framework. I would like to thank the authors for the work that has allowed me to conduct my thesis research.

@article{CrowdTruth2,
  author    = {Anca Dumitrache and Oana Inel and Lora Aroyo and Benjamin Timmermans and Chris Welty},
  title     = {CrowdTruth 2.0: Quality Metrics for Crowdsourcing with Disagreement},
  year      = {2018},
  url       = {https://arxiv.org/abs/1808.06080},
}

The CrowdTruth package can be installed using: 
"pip install crowdtruth"

# Part 1: Importing the Google Forms data into Pandas DataFrames



We first import the Annotation Survey results into a Pandas DataFrame

In [1]:
import pandas as pd
import crowdtruth
import csv
import os
from crowdtruth.configuration import DefaultConfig
import dateparser

general_df = pd.read_csv("data/CrowdTruth_Results.csv")


Afterwarsd, we split the dataframes into distinct dataframes. These are then written to distinct csv files which can be used for future analysis.

In [2]:
grouped_connection_type  = general_df.groupby('connection_type')
grouped_connection_level = general_df.groupby('connection_level')

We first group the Annotation results per connection type that the participants self-identified with

In [3]:
for group_name, group_df in grouped_connection_type:
    group_df.to_csv('data/Niches/' + f'{group_name}.csv', index=False)

Afterwards, we do the same for all of the connection levels that the participants self-identified with

In [5]:
for group_name, group_df in grouped_connection_level:
    group_df.to_csv('data/Niches/level_' + f'{group_name}.csv', index=False)

# Part 2: Preprocessing

The pre-processing configuration defines how to interpret the raw csv file with the results. TO do this, we need to define a configuration class. Some default configuration is inherited from DefaultConfig. We also define some additional attributes to the specific photograph annotation task from the annotation survey. Here are the most important ones:

- outputColumns: This contains all of the columns with the answers from the participants
CustomPLarformColumns: To ensure that the FrameWork functions properly, we need to define a judgment_id, unit_id, worker_id, start_time, and submit_time. This has already been done in the spreadsheet containing the Annotation Survey results.
- annotation_seperator: This string seperates between the participant annotations in the outputColumns
- annotation_vecotr: This contains a list of all possivle annotation answers
- open_ended_task, which is a boolean variable whether the task is open-ended (answer options are known beforehand or not). Participants were given the option to fill in terms that they deemed more relevant in the annotation task, which is why this variable has been set to true


The complete configuration class is declared below:

In [7]:
class TestConfig(DefaultConfig):
    
    inputColumns = ["link"]
    outputColumns = ["Answer"]
    customPlatformColumns = ["judgmentId", "unitId", "workerId", "startedAt", "submittedAt"]
    open_ended_task = True
    annotation_vector = ["emancipatie", "revolutie", "vrijheidsstrijd", "opstand","onrust" ,"gelijkheid" ,"protest" ,"bevrijding" ,"vandalisme" ,"rechtvaardigheid"
                         ,"verwoesting" ,"onafhankelijkheid" ,"verzet" ,"oproer" ,"staking", "anarchie", "opkomen", "ongeluk/niet met opzet", 
                         "opstandigheid", "omverwerping", "uitbraak", "dissidentie", "ongehoorzaamheid", "muiterij", "hulp", "ongeloof bij de mensen","orde herstellen",
                         "bemoeienis", "onderdrukken", "autoriteit", "oorlog", "(te late) damage control", "saamhorigheid", "droevigheid", "verlies", "heel triest", "oneerlijkheid",
                         "destructie", "woede ", "boosheid", "vechtlust ", "wanhoop", "eenheid"]
    annotation_separator = ", "


    def processJudgments(self, judgments):
        # pre-process output to match the values in annotation_vector
        for col in self.outputColumns:
            # transform to lowercase
            judgments[col] = judgments[col].apply(lambda x: str(x).lower())
            # remove square brackets from annotations
        return judgments

# Part 3: Defining the functions to generate annotation/unit results
We then define one function to preprocess the input csv files that have been filtered on the different niche groups (connection_level/connection_types).
The function below generates the following metrics:

- The Unit Annotation Score (UAS), which shows the degree to which one annotation is chosen over another annotations. This is shown per annotated photograph
- The most frequently selected terms in the complete annotation task. As this was a sparse multiple choice, free input task, this allowed us to show the most commonly selected terms for all of the ten photographs.
- The Unit Quality Score (UQS), which showed shows the overall participant agreement in every annotated photograph. 

In [8]:
def calculate_results(file):
    data, config = crowdtruth.load(
    file = file,
    config = TestConfig()
    )

    results = crowdtruth.run(data, config)

    uas = results["units"]["unit_annotation_score"]
    aqs = results["annotations"]
    uqs = results["units"]["uqs"]


    return uas, aqs, uqs

We first do this for every connection level

In [None]:

root_directory = 'data/Niches'


for root, _, files in os.walk(root_directory):
    for file_name in files:
        
        file_path = os.path.join(root, file_name)

        uas, aqs, uqs = calculate_results(file_path)
        aqs.to_csv("data/results/Grouped_results/AQS/AQS_" + f"{file_name}")
        uas.to_csv("data/results/Grouped_results/UAS/UAS_" + f"{file_name}")
        uqs.to_csv("data/results/Grouped_results/UQS/UQS_" + f"{file_name}")
        
        

We then do this for every connection type

In [11]:

root_directory = 'data/results/Grouped_results/Connection_types'

# Use os.walk to recursively traverse the directory tree
for root, _, files in os.walk(root_directory):
    for file_name in files:
        # Get the full path of the file
        file_path = os.path.join(root, file_name)
        print(file_path)

        # Call your function with the file path
        # uas, aqs, uqs = calculate_results(file_path)
        aqs.to_csv("data/results/Grouped_results/AQS/AQS_" + f"{file_name}")
        uas.to_csv("data/results/Grouped_results/UAS/UAS_" + f"{file_name}")
        uqs.to_csv("data/results/Grouped_results/UQS/UQS_" + f"{file_name}")

# Part 4: generating Worker results

In [171]:

data, config = crowdtruth.load(
    file = 'data/CrowdTruth_Results.csv',
    config = TestConfig()
    )

results = crowdtruth.run(data, config)


In [174]:
results["workers"].to_csv('data/results/workers/results.csv')
