In [1]:
# Reference: https://jupyterbook.org/interactive/hiding.html
# Use {hide, remove}-{input, output, cell} tags to hiding content

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
%matplotlib inline
import ipywidgets as widgets
from ipywidgets import interact, interactive, fixed, interact_manual
from IPython.display import display

sns.set()
sns.set_context('talk')
np.set_printoptions(threshold=20, precision=2, suppress=True)
pd.set_option('display.max_rows', 7)
pd.set_option('display.max_columns', 8)
pd.set_option('precision', 2)
# This option stops scientific notation for pandas
# pd.set_option('display.float_format', '{:.2f}'.format)

def display_df(df, rows=pd.options.display.max_rows,
               cols=pd.options.display.max_columns):
    with pd.option_context('display.max_rows', rows,
                           'display.max_columns', cols):
        display(df)

# Big Data Hubris 

When we have large amounts of administrative data or expansive digital traces, it can be tempting to treat them as more definitive than data collected from traditional smaller research studies. We might even consider these large datasets as a replacement for scientific studies or as a census. This over-reach is referred to as the “big data hubris” {cite}`lazer2014`. However, data with a large scope does not imply that we can ignore foundational issues of how representative the data are, nor can we ignore issues with measurement, validity, dependency, and reliability (these terms are defined more precisely later). One well-known example is the Google Flu Trends tracking system.  

(ex:gft)=
## Example: Google Flu Trends

The Google Flu Trends (GFT) tracking system used millions of digital traces from online queries for terms related to influenza to predict flu activity. GFT was built by compiling and analyzing data from vast amounts of online searches, data created unintentionally by the users of Google. While these data have proven valuable, they are not a substitute for more traditionally collected data, such as the Centers for Disease Control (CDC) surveillance reports on influenza collected from laboratories across the United States. They are also not a substitute for simple models built from past CDC reports. In the 2011–2012 flu season, GFT overestimated the CDC numbers for 100 out of 108 weeks (see {numref}`fig:GFTseries`). Week after week, GFT came in too high for the cases of influenza, even though it was based on big data. In fact, a simple model that used 3-week-old CDC data and seasonal trends did a better job of predicting flu prevalence than GFT. (Google stopped publishing GFT in 2015.) GFT overlooks considerable information that could be extracted by basic statistical methods. This does not mean that big data captured from online activity is useless. ADD EXAMPLE. Further, researchers have shown that the combination of GFT data with lagged CDC data can substantially improve on both GFT predictions and the CDC-based model {cite}`lazer2014,lazer2015`. It is often the case that combining different approaches leads to improvements over individual methods.


```{figure} GFTseries.png
---
name: fig:GFTseries
---
Google Flu Trend (GFT) weekly estimates for influenza-like illness. For 108 weeks, (magenta) GFT over estimated the actual CDC reports (blue) 100 times. Also plotted are predictions from a model based on 3-week old CDC data (blue) and seasonal trends (green).  
```

## New Opportunities

The tremendous increase in openly available data has created new roles and opportunities. For example, a data journalist looks for interesting stories in ‘found’ data, much like how traditional beat reporters hunt for news stories. The data life cycle for the data journalist begins with the search for existing data that might have  an interesting story, rather than beginning with a research question and looking for how to collect new or use existing data to address the question.  

Citizen science projects are another example. They engage many people (and instruments) in data collection. Collectively, these data are made available to researchers who organize the project and often they are made available in repositories for anyone to further investigate. 

The availability of administrative data creates yet another opportunity. Researchers can link data collected from scientific studies with, say government or medical data that has been collected for government or business purposes; i.e., these data were collected for reasons that don’t directly stem from the question of interest. Such linkages can help researchers expand the possibilities of their analyses and cross-check the quality of their data. Administrative data can include digital traces, such as the study of an online social network, and can be quite complex.  

In all of these examples, a key factor to keep in mind is the scope of the data, including the connections between the data, the topic of investigation, and the question being asked. Understanding this framework can help us avoid answering the wrong question, applying inappropriate methods to the data, and overstating our findings. In the age of Big Data, we are tempted to deal with this dilemma by collecting and analyzing ever more data. After all, a census gives us perfect information so shouldn’t big data be nearly perfect? This is the topic of the next section.

Note: Later, in the {ref}`ch:quality` chapter,  we consider the quality of the data collected in terms of how the information is coded, transformed, cleaned and prepared for analysis. 