# Chapter 3 - Linear regression
Mathematically, a linear relationship between `X` and `Y` can be written as 
$$Y \approx \beta_{0} + \beta_{1}X$$

$\beta_{0}$ and $\beta_{1}$ represent the intercept and slope. They are the model coefficients or parameters. Through regression we estimate these parameters. The estimates are represented as $\hat\beta_{0}$ and $\hat\beta_{1}$. Thus,

$$\hat y = \hat\beta_{0} + \hat\beta_{1}x$$

We estimate $\hat\beta_{0}$ and $\hat\beta_{1}$ using `least squared regression`. This technique forms a line that minimizes average squared error for all data points. Each point is weighed equally. If 

$$\hat y_{i} = \hat\beta_{0} + \hat\beta_{1}x_{i}$$
is the prediction for `i`th value pair of x, y, then the error is calculated as 
$$e_{i} = y_{i} - \hat y_{i}$$. This error is also called a `residual`. This the **residual sum of squares (RSS)** is calculated as
$$RSS = e_{1}^{2} + e_{2}^{2}... + e_{i}^{2}$$

Thus if the relationship between $X$ and $Y$ is approximately linear, then we can write:
$$Y = \beta_{0} + \beta_{1}X + \epsilon$$

where $\epsilon$ is the catch-all error that is introduced in forcing a linear fit for the model. The above equation is the population regression line. In reality, this is not known (unless you synthesize data using this model). In practice, you estimate the population regression with a smaller subset of datasets.

Using **Central Limit Theorem**, we know the average of a number of sample regression coefficients, predict the population coefficients pretty closely. Proof is availble [here](verifying_clt_in_regression.html).

## Standard error
By averaging a number of estimations of $\hat\beta_{0}$ and $\hat\beta_{1}$, we are able to estimate the population coefficients in an **unbiased** manner. Averaging will greatly reduce any systematic over or under estimations when choosing a small sample.

Now, how far will a single estimate of $\hat\beta_{0}$ be from the actual $\beta_{0}$? We can calculate it using **standard error**. 

To understand **standard error** let us consider the simple case of estimating population mean using a number of smaller samples. The standard error in a statistic (mean in this case) can be written as:

$$ SE(\hat\mu) = \frac{\sigma}{\sqrt{n}}$$
where $\hat\mu$ is the estimate for which we calculate the standard error for (sample mean in this case), $\sigma$ is the standard deviation of the **population** and $n$ is the size of the sample you draw each time.

The SE of $\hat\mu$ is the same as the **Standard Deviation** of the sampling distribution of a number of sample means. Thus, the above equation gives the relationship between **sample mean** and **population mean** and **sample size** and how far will the sample mean be off. Thus:

$$ SD(\hat\mu) = SE(\hat\mu) = \frac{\sigma}{\sqrt n}$$

Note, SE is likely to be smaller if you have a large sample. The value of SE is the **average amount your $\hat\mu$ deviates from $\mu$**.

In reality, you don't have $\sigma$ or $\mu$. Thus, using the SE formula, we can calculate the SD of population as:

$$\sigma = \sigma_{\hat\mu}*\sqrt n $$

### Standard error in regression
