Skip to content

LemengLiang/SSDataBench-Data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 

Repository files navigation

SSDataBench-Data

Overview

Large Language Models (LLMs) hold great promise for generating social science data, potentially expanding the methodological toolkit of quantitative social research. Prior studies have primarily focused on individual-level predictability or behavioral plausibility of LLM-generated data. We propose a new framework for assessing the validity of LLM-generated data by returning to the foundational principles of survey research in the social sciences. Just as surveys based on representative samples yield statistics that approximate the corresponding statistical moments of the target population, assessments should center on the ability of LLM-generated data to reproduce real-world, population-level statistical patterns.

We introduce SSDataBench, the first systematic benchmark designed to evaluate population-level statistical realism in LLM-generated social science data. The benchmark assesses five types of statistical patterns central to social research: univariate distributions, bivariate associations, multivariate outcome predictions, life event sequence distributions, and associations between life event sequences and covariates. We illustrate SSDataBench using four longitudinal datasets and three cross-sectional datasets spanning six major social domains: demographics, socioeconomic status, marriage, health, abilities, and attitudes. Our study reveals systematic representational limitations in current LLMs, manifested in a pronounced tendency to compress real-world heterogeneity into simplified topological structures.

This repository provides the data processing codes used in this study, along with a limited set of processed datasets that are permitted to be shared. The goal is to facilitate transparency and replication while respecting data use agreements of the original sources. Researchers can use the code in this repository to replicate our workflow or apply the same procedures to the original microdata obtained directly from the data providers.

Repo Contents

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors