In [1]:
import sys
import os
if not any(path.endswith('textbook') for path in sys.path):
    sys.path.append(os.path.abspath('../../..'))
from textbook_utils import *

(ch:lifecycle)=
# The Data Science Lifecycle

Data science is a rapidly evolving field.
At the time of this writing people are still trying to pin down exactly
what data science is, what data scientists do, and what skills data 
scientists should have.
What we do know, though, is that data science uses a combination of 
methods and principles from statistics and computer science to work with and draw insights from data.
We use these insights to make all sorts of important decisions; 
data science helps assess whether a vaccine works,
filter out spam from email inboxes, calibrate air quality sensors, 
and advise analysts on policy changes. 

This book aims to prepare you for real-world data analysis. In theory, drawing conclusions from data is simple--load a data table, make a plot, and fit a model. In practice, it tends to be more complex. Data are messy. Data sources collect data in different formats. Data values go missing. A simple linear model is not always appropriate. How do we pick from many possible alternative models? And how do we generalize our conclusions outside our limited data sample?

This book covers fundamental principles and skills
that data scientists need to perform data analyses.
To help you keep track of the bigger picture, we've organized these topics
around a workflow that we call the *data science lifecycle*.
This chapter introduces the data science lifecycle.
It also provides a map for the rest of the book, showing you where 
each chapter fits into the lifecycle.
Unlike other books that focus on one part of the lifecycle, this book
covers the entire cycle from start to finish.
We explain theoretical concepts and show how they work in
practical case studies.
Throughout the book, we rely on real data from analyses by other data
scientists, not made-up data, so you can learn how to perform your own data acquisition, cleaning, exploration, and formal analyses and draw sound
conclusions.

```{figure} figures/ds-lifecycle.svg
---
name: ds-lifecycle
---

This diagram of the data science lifecycle shows four high-level steps.
The arrows indicate how the steps can lead into one another.
```

{numref}`Figure %s <ds-lifecycle>` shows the data science lifecycle.
It's split into four stages: asking a question, obtaining data, 
understanding the data, and understanding the world.
We've made these stages very broad on purpose.
In our experience, the mechanics of the lifecycle change frequently.
Computer scientists and statisticians continue to build new software packages and programming languages
for working with data, and they develop new analysis techniques that are more accurate and specialized. 
Despite these changes, we've found that almost every data project follows the four steps in our lifecycle.
The first is to ask a question.

*Ask a Question* The first step, asking questions, lies at the heart of data science, and
different kinds of questions require different kinds of analyses.
For example, "How have house prices changed over time?" is very different from
"How will this new policy affect house prices?"
Understanding our research question helps us determine the data we need,
the patterns to look for,
and how to interpret results.
In this book, we focus on three broad categories of questions:
exploratory, inferential, and predictive.

Narrowing down a broad question into one that can be answered with data is a key element of this first stage in the lifecycle. It can involve consulting the people participating in a study, figuring out how to measure something, and designing data collection protocols. These considerations help us plan the data collection phase of the lifecycle. 

*Obtain Data* When data are expensive and hard to gather, we aim to define a precise research question first and then collect the data needed to answer the question. Other times, data are cheap and easily accessed.
This is especially true for online data sources.
For example, Twitter lets people quickly download millions of data
points [^twitter].
When data are plentiful, we can also start an analysis by obtaining data, exploring it, and then honing the research question.

Most datasets have missing values, weird values, or other anomalies that we need to account for. When we obtain data, we need to check its quality. We typically must manipulate the data, where we clean and transform the data to prepare for analysis.

[^twitter]: https://developer.twitter.com/en/docs/twitter-api

*Understand the Data* After obtaining data,  we want to understand the data, and *exploratory data analysis* is a key part of this. 
In our explorations we make plots to uncover interesting patterns and summarize the data visually. We also look for problems with the data.
As we search for patterns and trends in the data, we use summary statistics and build statistical models, like linear and logistic regression.

In our experience, this stage of the lifecycle is highly iterative.
Understanding the data can lead us back to any of the earlier stages in the data science lifecycle to, say, modify and redo our data cleaning and manipulation, or on to making generalizations about the world. 

*Understand the World* Often our goals are purely exploratory, and the analysis can end at the "understanding the data" stage of the lifecycle. 
At other times, we aim to quantify how well the trends we find generalize beyond our data, and we want to make inferences about the world or give predictions of future observations. 
To draw inferences from a sample to a population, we use
statistical techniques like A/B testing and confidence intervals.
And to make predictions for future observations, we create other interval estimates and use test/train splits of the data. 

:::{note}

Understanding the difference between exploration, inference, and prediction can be a challenge. And we can easily slip into confusing a correlation found in data with a causal relationship. An inferential  analysis might observe a correlation in response to the question, "Do people who have a greater exposure to air pollution have a higher rate of lung disease?" Whereas a causal question might ask "Does giving an award to a Wikipedia contributor increase their productivity?" We typically cannot answer causal questions unless we have a randomized experiment (or approximate one). We point out these important distinctions throughout the book.

:::

Each chapter in this book tends to focus on one of these stages. We map them out next. 