In [1]:
import sys
import os
if not any(path.endswith('textbook') for path in sys.path):
    sys.path.append(os.path.abspath('../../..'))
from textbook_utils import *

(sec:linear_simple)=
# Simple Linear Model

Modeling starts with an outcome variable $ y $ and one or more
predictor variables $ x $.
When $ y $ is a numeric variable like height or income, we say that we're
doing *regression*.
When $ y $ is a categorical variable like
the presidential candidate on a ballet, we say that we're
doing *classification*.
We assume that $ x $ is numeric [^feat].

[^feat]: Later in this chapter
({numref}`Section %s <sec:linear_feature_eng>`) 
we'll see ways of working with categorical predictor variables.

When we use a simple linear model,
we assume that the outcome $ y $ depends linearly
on a single predictor $ x $ with some random measurement error $ \epsilon $:

$$
\begin{aligned}
y = \theta_0 + \theta_1 x + \epsilon
\end{aligned}
$$


The simple linear model $ f_{\theta}(x) $ doesn't try to model the measurement
error. Instead, it predicts $ y $ for a single value of $ x $:

$$
\begin{aligned}
f_{\theta}(x) = \theta_0 + \theta_1 x
\end{aligned}
$$


In the equations above, $ \theta_0 $ and $ \theta_1 $ are constants that
we call the *model parameters*.
Our goal in modeling is to figure out what $ \theta_0 $ and $ \theta_1 $
are.

:::{note}

We'll work with $ \epsilon $ more rigorously in future chapters.
For now, the important thing to remember is that the model depends on
$ x $. When $ \theta_1 $ is positive, bigger values of $ x $ make the
prediction $ f_{\theta}(x) $ bigger.

:::

The simple linear model is useful because $ x $ and $ y $ can be any two
variables of interest. In this chapter, we'll use linear models to understand
what factors contribute to wealth opportunity in the US.

## Data: Where is the Land of Opportunity?

In a famous study, the economist Raj Chetty and his colleagues
did a large-scale data analysis on 
economic mobility in the US {cite}`chettyWhere2014`.
The US was nicknamed "the land of opportunity" because people believed that
even poor people in the US could end up wealthy---an economist would say that
the US had high economic mobility. 
But Chetty had a hunch that some places in the US have much higher economic
mobility than others.
His analysis found this to be true.
Cities like San Jose, Washington DC, and Seattle have higher
mobility than places like Charlotte, Milwaukee, and Atlanta.
This means that overall, more people with low incomes in San Jose
end up with high incomes than people in Charlotte.

Chetty also used linear models to find out that social and economic
factors like segregation, income inequality, and local school systems 
are related to economic mobility.
To do this analysis, Chetty used government records to get the incomes of
everyone born in the US between 1980-82.
He took the parent incomes and compared them to their children's incomes
in 2011-12, when the children were about 30 years old.
