An overview and exploration of the concept of missing datasets.
Latest commit 23f89c4 Feb 15, 2017 @MimiOnuoha committed on GitHub Merge pull request #3 from Sparshith/patch-1
Fixing broken link
Failed to load latest commit information.
resources Initial commit Feb 3, 2016 Merge pull request #3 from Sparshith/patch-1 Feb 15, 2017

On Missing Data Sets

This repo will be periodically updated with more information, links, and topics. Most recent update: 08/15/16.


What is a Missing Data Set?

"Missing data sets" are my term for the blank spots that exist in spaces that are otherwise data-saturated. My interest in them stems from the observation that within many spaces where large amounts of data are collected, there are often empty spaces where no data live. Unsurprisingly, this lack of data typically correlates with issues affecting those who are most vulnerable in that context.

The word "missing" is inherently normative, it implies both a lack and an ought: something does not exist, but it should. That which should be somewhere is not in its expected place; an established system is disrupted by distinct absence. Just because some type of data doesn't exist doesn't mean it's missing, and the idea of missing data sets is inextricably tied to a more expansive climate of inevitable and routine data collection.

Why Do They Matter?

That which we ignore reveals more than what we give our attention to. It’s in these things that we find cultural and colloquial hints of what is deemed important. Spots that we've left blank reveal our hidden social biases and indifferences.

Why Are They Missing?

There are a number of reasons why a data set that seems like it should exist might not, and they are all tied to the quiet complications inherent in data collection. Below are four reasons, with accompanying real-world examples.

  1. Those who have the resources to collect data lack the incentive to.

    Police brutality towards civilians provides a powerful example. Though policing and crime are among the most data-driven areas of public policy, traditionally there has been little history of standardized and rigorous data collected about police brutality.

    Nowadays we've got a political and cultural climate where this issue has become one of public discussion. Public interest campaigns like Fatal Encounters and the Guardian’s The Counted have helped fill that void. But even for these individuals/organizations the work is difficult and time-consuming. The group who would make the most sense to monitor this issue—the law enforcement agents who create the data set in the first place—have no incentive to actually gather such data, which could prove incriminating.

  2. The data to be collected resist simple quantification (corollary: we prioritize collecting things that fit our modes of collection).

    The defining tension of data collection is the struggle of taking a messy, organic world and defining it in formats that are neat, clean, and structured.

    Some things are difficult to collect and quantify by nature of their structure. We don't know how much US currency is outside of our borders. There's no incentive for other countries to monitor US currency within their countries, and the very nature of cash and the anonymity it affords makes it difficult to track.

    But then there are other subjects that resist quantification entirely. Things like emotions are hard to quantify (at this time, at least). Institutional racism is subtle and deniable; it reveals itself more in effects than in acts. Not all things are easily quantifiable, and at times the very desire to render the world more abstract, trackable, and machine-readable is an idea that itself deserves questioning.

  3. The act of collection involves more work than the benefit the presence of the data is perceived to give.

    Sexual assault and harassment are woefully underreported. And while there are many reasons why this is, one major one is that in many cases the very act of reporting sexual assault is a very intensive, painful, and difficult process. For some, the benefit of reporting isn't perceived to be equal or greater than the cost of the process.

  4. There are advantages to nonexistence.

    To collect, record, and archive aspects of the world is an intentional act. There are situations in which it can be advantageous for a group to remain outside of the oft-narrow bounds of collection. In short, sometimes a missing datset can function as a form of protection.

Below is an ever-expanding list of missing datasets. Contributions are extra welcome.

An Incomplete List of Missing Data Sets

This list will always be incomplete, and is designed to be illustrative rather than comprehensive.

  • Civilians killed in encounters with police or law enforcement agencies
  • Sales and prices in the art world (and relationships between artists and gallerists)
  • People excluded from public housing because of criminal records
  • Trans people killed or injured in instances of hate crime
  • Poverty and employment statistics that include people who are behind bars
  • Muslim mosques/communities surveilled by the FBI/CIA
  • Mobility for older adults with physical disabilities or cognitive impairments
  • LGBT older adults discriminated against in housing
  • Undocumented immigrants currently incarcerated and/or underpaid
  • Undocumented immigrants for whom prosecutorial discretion has been used to justify release or general punishment
  • Measurements for global web users that take into account shared devices and VPNs
  • True measures around how often sexual harassment happens in the workplace
  • Firm statistics on how often police arrest women for making false rape reports
  • Caucasian children adopted by parents of color
  • Total number of local and state police departments using stingray phone trackers (IMSI-catchers)

Responses & Hypotheses

As part of my Data & Society fellowship, I'm working on a number of projects that aim to consider possible responses to missing data sets. This will be updated with documentation of those projects as they are published.

  • Data won't solve all problems. Data are useful for informing a debate, increasing knowledge, shaping a conversation, and providing context. Data can give the ability to have knowledge about trends, and how things have changed over time. But having data isn't enough to solve all problems (just because we now know how many people are killed in moments of police brutality doesn't mean that police brutality has ended. )

  • Collective action is a strategy for resistance. My hypothesis is that one answer to these missing datasets lies in those who have a stake in the data cooperating to disrupt the structures preventing access to it, a la Jonah Bossewitch and Aram Sinnreich's sousveillance society model (see Resources folder for paper).

  • Lack of collection is also a strategy. This has been said before, but bears repeating. A tricky aspect of dealing with missing data sets is that they hint at larger problems, and the answer to those problems does not universally lie in collecting more data.



"The Point of Collection" - a piece I wrote for Data & Society's Points publication that expresses much of the conceptual background for this.

"The Detroit Geographic Expedition and Institute: A Case Study in Civic Mapping"" - Catherine D'Ignazio's excellent case study on the Detroit Geographic Society, which was a fascinating 1960s response to missing data.

"Where We Live and How We Die" - article for How We Get To Next that explores death and data, highlighting how difficult it is to talk about both as a result of missing data.


See Resources folder for articles and papers.

Wise Words from Others

Adam Obeng, Columbia University sociology PhD candidate who studies all things computational, sent me a thoughtful email in response to this repo. See below:

As a sociologist, I think this traces back to the rationalisation of governments and workplaces. If you're trying to make a systematic system, you need to remove discretion, which involves collecting lots and lots of data. Because you're trying to remove human discretion, the data are about people. In industry, this leads to Pearson's Law: "what is measured is improved". It's also why politicians grasp onto "hard" numbers, however dubious their source (e.g. opinion polls...).

See his complete response in the Resources folder.