In [1]:
import sys
import os
if not any(path.endswith('textbook') for path in sys.path):
    sys.path.append(os.path.abspath('../../..'))
from textbook_utils import *

(ch:lifecycle)=
# The Data Science Lifecycle

Data science is a rapidly evolving field.
At the time of this writing people are still trying to pin down exactly
what data science is, what data scientists do, and what skills data 
scientists should have.
What we do know, though, is that data science uses a combination of 
methods and principles from statistics and computer science to draw insights from data.
We use these insights to make all sorts of important decisions. 
Data science helps assess whether a vaccine works,
filter out spam from our email inboxes, calibrate air quality sensor, and advise public policy analysts on policy changes. 

This book covers fundamental principles and skills
that data scientists need to perform data analyses.
To help you keep track of the bigger picture, we've organized these topics
around a workflow that we call the *data science lifecycle*.
This chapter introduces the data science lifecycle.
It also provides a map for the rest of the book by showing you where 
each chapter fits into the lifecycle.
Unlike other books that focus on one part of the lifecycle, this book
covers the entire cycle from start to finish.
We explain theoretical concepts and show how they work in
practical case studies.
Throughout the book, we rely on real data from analyses by other data
scientists, not made-up data, so you can learn how to perform your own data analyses and draw sound
conclusions.

```{figure} figures/ds-lifecycle.svg
---
name: ds-lifecycle
---

This diagram of the data science lifecycle shows four high-level steps.
The arrows indicate how the steps can lead into one another.
```

{numref}`Figure %s <ds-lifecycle>` shows the data science lifecycle.
It's split into four stages: asking a question, obtaining data, 
understanding the data, and understanding the world.
We've made these stages very broad on purpose.
In our experience, the mechanics of a data analysis change fequently.
Computer scientists and statisticians continue to build new software packages and programming languages
for analysis, and they develop new analysis techniques that are more accurate and specialized. 
Despite these changes, we've found that almost every data analysis follows
the four steps in our lifecycle.
The first is to ask a question.

Asking questions lies at the heart of data science, and
different kinds of questions require different kinds of analyses.
For example, "How have house prices changed over time?" is very different from
"How will this new policy affect house prices?".
Understanding our research question helps us determine the data we need,
the patterns to look for,
and how to interpret results.
In this book, we focus on three broad categories of questions:
exploratory, inferential, and predictive.

Narrowing down a broad question into one that can be answered with data is a key element of this first stage in the lifecycle. It can involve consulting the people participating in the study, figuring out the measurements needed, and designing data collection protocols. These considerations help us plan the data collection phase of the lifecycle. 

When data are expensive and hard to gather, we define a precise research question first and then collect the data needed to answer the question. Other times, data are cheap and easily accessed.
This is especially true for online data sources.
For example, Twitter lets people quickly download millions of data
points [^twitter].
When data are plentiful, we can also start an analysis by obtaining data, exploring it, and then honing the research question. 

[^twitter]: https://developer.twitter.com/en/docs/twitter-api

After obtaining data, we want to understand the data, and 
*exploratory data analysis* is a key part of this. 
In our explorations we make plots to uncover interesting patterns and summarize the data visually. We also look for problems with the data.
Most real-world datasets have missing values, weird values, or other anomalies that we need to account for. We manipulate the data, cleaning and transforming it to prepare for analysis.
As we search for patterns and trends in the data, we use summary statistics and build statistical models, like linear and logistic regression.

In our experience, this stage of the lifecycle is highly iterative.
Understanding the data can lead us back to to any of the earlier stages in the data science lifecycle, or on to making generalizations about the world. 

Often our our goals are purely exploratory, and our analysis can end at the "understanding the data" stage of the lifecycle. 
At other times, we aim to quantify how well the trends we find generalize beyond our data. We aim to make inferences about the world or give predictions of future observations. 
To draw inferences from a sample to a population, we use
statistical techniques like A/B testing and confidence intervals, and we make prediction intervals for future observations. 

Each chapter in this book tends to focus on one of these stages. We map them out next. 