### Linear Least Squares
Correlation coefficients measure the strength and sign of a relationship, but not the slope. One way to estimate the slope is by using a **linear least squares fit**. 
- The **linear fit** is a line modeling the relationships between variables
- The **least squares** fit  is the one minimizing the mean squared error (MSE) between the line and the data

Say we have some x- and y-values representing different variables. IF there is a linear relationship between them, then we expect each y-value to be related to an x-value through `intercept + slope * xs[i]`. We can then find the **residual**, or the vertical deviation from the line, using:

```
res = ys - (intercept + slope * xs)
```

The residuals might be due to random factors like measurement error, or non-random factors that are unknown. For example, if we are trying to predict someone's weight based on their height, unknown factors might include diet, exercise, and body type.

If we get the parameters `intercept` and `slope` wrong, the residuals get bigger, so we want the parameters that minimize the residuals. The most common choice when minimizing residuals is minimizing the sum of squared residuals, `sum(res**2)`. Why?
- Squaring will treat the positive and negative residuals the same
- Squaring gives more weight to large residuals, but not so much that the largest residual always dominates
- If the residuals are uncorrelated and normally distributed with mean 0 and constant, unknown variance, then the least squares fit is also the maximum likelihood estimator of `intercept` and `slope`.

Let's see this in action. 

In [23]:
import pandas as pd
import numpy as np

def Covariance(xSample, ySample):
    xMean = xSample.mean()
    yMean = ySample.mean()
    return np.dot(xSample - xMean, ySample - yMean) / xSample.count()

df = pd.read_pickle("nsfg_data.pkl")
live = df[df["outcome"] == 1]
noNan = live[(live["agepreg"].isna() == False) & (live["totalwgt_lb"].isna() == False)]
ages = noNan["agepreg"]
birthWgt = noNan["totalwgt_lb"]

xVar = np.var(ages, ddof=1)
xMean = np.mean(ages)
yMean = np.mean(birthWgt)

# Here, we calculate the slope and intercept for the linear fit for these two variables
slope = Covariance(ages, birthWgt) / xVar # The "n's" used to calculate both of these cancel out
intercept = yMean - slope * xMean

print(slope, intercept)

0.017451920308772035 6.830445129040682
