Skip to content

Digicomlab/Comparative-Media-Dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 

Repository files navigation

logo

Comparative Media Dataset

DOI status GitHub Repo stars GitHub Repo stars GitHub Repo stars

The analysis of textual content -- such as news articles, online reviews, corporate press releases and parliamentary questions -- is at the heart of communication science. Traditionally, news articles can be accessed through subscription-based databases, such as LexisNexis. Yet, these databases often prohibit batch downloading with automated means, which makes obtaining media content at a large scale a tedious task. Although harvesting articles directly from news outlets is possible, a huge amount of time and effort is needed to maintain web scrapers for each individual outlet. Recently, some efforts were made to curate large scale comparative news dataset, such as INCA (Trilling et al., 2018) and the Comparative Agendas Project. However, these projects tend to focus on content produced in Europe and the US which poses serious limitations on our ability to conduct comparative research, especially with cases in non-Western contexts (Baden et al., 2022). Due to the tremendous effort needed to collect news content from multiple countries, large scale comparative media research on data that are not only from Europe or the US is rare. Hence, a database that provides communication scholars access to newspapers representative of peoples and cultures worldwide would offer a crucial contribution if we want all people to benefit from our research field. As large datasets are increasingly being analyzed with computational methods using crowd-coded annotations of the data, the goal of this project is to create an annotated database that enables the comparative analysis of news content across multiple languages and countries.

The Comparative Media Dataset consists of articles from the top outlets of the 90 participating countries in the Joint European Values Survey/World Values Survey 2017-2022 Dataset. To bypass the limitation on disseminating copyrighted materials, we will use an approach similar to the standard practice for sharing Twitter data: the dataset will not contain the actual content of the articles, but links to Common Crawl, an open access repository of web crawl data. A software package will be developed to extract the texts from the Common Crawl data files.

Timeline

  • 1 October, 2023: Invitation for contributions on annotation variables
  • 15 November, 2023: Completion of Crowd-annotation
  • 15 January, 2024 (tentative): Release of dataset

Team Member

Investigators

  • Justin Chun-ting Ho
  • Marthe Möller
  • Joanna Strycharz
  • Anne Kroon

Student Assistants

  • Huo Qiru
  • Zhu Dongdong

Funding

This project is funded by Amsterdam School of Communication Research's Mid-Size Collaborative Research Grant

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published