In [1]:
import sys
import os
if not any(path.endswith('textbook') for path in sys.path):
    sys.path.append(os.path.abspath('../../..'))
from textbook_utils import *

# The Stages of the Lifecycle

{numref}`Figure %s <ds-lifecycle>` shows the data science lifecycle.
It's split into four stages: asking a question, obtaining data, 
understanding the data, and understanding the world.
We've made these stages very broad on purpose.
In our experience, the mechanics of the lifecycle change frequently.
Computer scientists and statisticians continue to build new software packages and programming languages
for working with data, and they develop new methodologies that are more specialized. 
Despite these changes, we've found that almost every data project follows the four steps in our lifecycle.
The first step is to ask a question.

```{figure} figures/ds-lifecycle.svg
---
name: ds-lifecycle
---

This diagram of the data science lifecycle shows four high-level steps.
The arrows indicate how the steps can lead into one another.
```

*Ask a Question.* Asking good questions lies at the heart of data science, and recognizing
different kinds of questions guides us in our analyses.
For example, "How have house prices changed over time?" is very different from
"How will this new policy affect house prices?"
In this book, we focus on four broad categories of questions:
descriptive, exploratory, inferential, and predictive.
Narrowing down a broad question into one that can be answered with data is a key element of this first stage in the lifecycle. It can involve consulting the people participating in a study, figuring out how to measure something, and designing data collection protocols. 
A clear and focused research question helps us determine the data we need,
the patterns to look for, and how to interpret results.
These considerations help us plan the data collection phase of the lifecycle. 

*Obtain Data.* When data are expensive and hard to gather and when our aim is to generalize from the data to the world, then we aim to define precise protocols for collecting the data needed to answer the question. Other times, data are cheap and easily accessed.
This is especially true for online data sources.
For example, Twitter lets people quickly download millions of data
points [^twitter].
When data are plentiful, we can also start an analysis by obtaining data, exploring it, and then honing the research question.
In both situations, most data have missing values, weird values, or other anomalies that we need to account for. When we obtain data, we need to check its quality. And, typically, we must manipulate the data before we can analyze it more formally. We may need to modify the structure and clean and transform data values to prepare for analysis.

[^twitter]: https://developer.twitter.com/en/docs/twitter-api

*Understand the Data.* After obtaining data, we want to carefully examine them, and *exploratory data analysis* is key. 
In our explorations we make plots to uncover interesting patterns and summarize the data visually. We also continue to look for problems with the data.
As we search for patterns and trends, we use summary statistics and build statistical models, like linear and logistic regression.
In our experience, this stage of the lifecycle is highly iterative.
Understanding the data can lead us back to any of the earlier stages in the data science lifecycle. We may find that we need to modify or redo our data cleaning and manipulation, acquire more data to supplement our analysis, or refine our research question given the limitations of the data. The descriptive and exploratory analyses that we carry out in this stage may adequately answer our question, or, we may need to go on to the next stage in order to make generalizations beyond our data.

*Understand the World.* Often our goals are purely exploratory, and the analysis ends at the "understanding the data" stage of the lifecycle. 
At other times, we aim to quantify how well the trends we find generalize beyond our data. 
We may want to use a model that we have fitted to our data to make inferences about the world or give predictions for future observations. 
To draw inferences from a sample to a population, we use
statistical techniques like A/B testing and confidence intervals.
And to make predictions for future observations, we create other kinds of interval estimates and use test/train splits of the data. 

:::{note}

Understanding the difference between exploration, inference, and prediction can be a challenge. 
We can easily slip into confusing a correlation found in data with a causal relationship. 
For example, an inferential  analysis might observe a correlation in response to the question, "Do people who have a greater exposure to air pollution have a higher rate of lung disease?" Whereas a causal question might ask "Does giving an award to a Wikipedia contributor increase their productivity?" We typically cannot answer causal questions unless we have a randomized experiment (or approximate one). We point out these important distinctions throughout the book.

:::

Each chapter in this book tends to focus on one of these stages of the data science life cycle. We map them out next. 