# Measuring Complex Constructs in Large-Scale Text With Computational Social Mixed Methods

## What are Computational Social Mixed Methods?

Computational Social Science (CSS) has offered scientists a new lens on sociological and psychological phenomena, especially through making accessible large datasets reflecting raw human behavior. When processing such datasets a common task is to quantify a latent concept in text (e.g., hate speech), scaling up the classification with machine learning, and using the obtained labels or scores in statistical analyisis for inference or prediction. We use the term <b>computational social mixed methods (CSMM)</b>, because such pipelines include techniques from both the social and computational sciences, as well as qualitative and quantitative approaches. Because CSMM combine multiple methodologies from fields that were unrelated only a few years ago, executing them well is challenging.

## Learning objectives

One instance of CSMM pipelines spans data annotation, machine learning classification, and statistical analysis.

By the end of this tutorial, you will be able to

1. implement major steps in all of data annotation, machine learning classification and statistical analysis.
2. understand how synergies between social science and computational methods can emerge in CSMM pipelines.
3. overcome common challenges of CSMM pipelines.

## Target audience

This tutorial is aimed at both computational and social science scholars with at least basic knowledge in Python. An advantage is to be familiar with common machine learning libraries scikit learn, scipy, transformers, datasets, and torch. Furthermore, it will be beneficial to be accustomed to basic machine learning terminology (such as train-test-split, or common evaluation metrics such as the F1 score). In case you first need to familiarize yourself with those concepts, please refer to the accompanying publication (Herderich et al., 2026; see below).

## Structure of the tutorial

This tutorial is accompanying code for the publication: Herderich, A., Lasser, J., Galesic, M., Aroyehun, S., David, G., & Garland, J. (2026). Measuring complex constructs in large-scale text with computational social mixed methods. <i>BRM, under review</i>.

The figure below illustrates the workflow of a CSMM pipeline. Not all steps in the pipeline lend themselves equally to illustrate with code. Therefore, we focus on the following aspects:

  - In <b>data annotation</b>, we cover calculating interrater reliability, sampling training data to be annotate from a larger corpus, as well as Iterative Annotation as a strategy against unbalanced training data.
  - In <b>model training</b>, we elaborate on LLMs versus traditional transformers as classifiers, show how to measure model performance, n-fold cross validation, masked language modeling, hyperparameter tuning, as well as a training strategy called Training on Confident Examples.
  - In <b>statistical analysis</b>, we explain how to account for classifier inaccuracies in group comparisons by empirically determining classifier thresholds assigning samples to groups, and how to execute bootstrapping based on a classifier's false positive and false negative rate.

The parts marked in blue are original solutions to the respective challenges. For a broader literature review on solutions for all steps of the pipeline, we kindly refer to the publication.

<img src="https://raw.githubusercontent.com/Hai-Lina/computational-social-mixed-methods-pipelines/GESIS-methods-hub/figures/CSMM-pipeline.png" width="400">

## Setting up the computational environment

The following Python packages are required:

In [None]:
! pip install numpy==1.26.4 pandas==2.3.3 matplotlib==3.10.8 joblib==1.5.3 tqdm==4.67.3
! pip install scikit-learn==1.8.0 scipy==1.12.0 transformers==5.2.0 datasets==4.4.1 torch==2.10.0 evaluate==0.4.6
! pip install irrCAC==0.4.4
! pip install openai==2.24.0

You can also build the environment with the respective <i>environment.yml</i> as described in the Github README.

## Duration

Reading through the tutorial (explanations, code annotations) takes around 60 minutes. The notebooks can be run on a regular laptop with and without GPU. With GPU, the runtime is around 6 minutes, without GPU around 20 minutes. To work through the code including the publication, we would recommend allocating two working days for the tutorial.

## Social Science Usecase(s)

We illustrate this method with a working example on the effectiveness of counter speech against hate in online political discussions published in Lasser, J., Herderich, A., et al. (2025). Collective moderation of hate, toxicity, and extremity in online discussions. <i>PNAS Nexus, 4</i>(11), pgaf369. We investigated, for example, how counter speech is used in natural conversations and whether different types of strategies (e.g., providing facts or raising opinions) have different effects on conversational outcomes (e.g., prevalence of hate speech).

## References
Herderich, A., Lasser, J., Galesic, M., Aroyehun, S., David, G., & Garland, J. (2026). Measuring complex constructs in large-scale text with computational social mixed methods. <i>BRM, under review</i>.

Lasser, J., Herderich, A., Garland, J., Aroyehun, S., David, G., & Galesic, M. (2025). Collective moderation of hate, toxicity, and extremity in online discussions. <i>PNAS Nexus, 4</i>(11), pgaf369. https://doi.org/10.1093/pnasnexus/pgaf369

<b>To continue with the tutorial, please refer to notebook "1_data_annotation".</b>