Reading: 193 - 200.

### I. Least Squares Applications with Real Data Sets

So far we have used least squares (regression) models to solve toy problems with small artificial data sets. In this lecture (and Lab 5) we will build models using real data sets. Using real data is messy. Most of these problems will involve extra steps getting the data preprocessed into a useable design matrix $X$. 

Recall that the once we have specified our model (linear, quadratic, exponential, etc) and built our design matrix we want to find a solution to 

\begin{align} X\beta = y. \end{align}

Unless $y$ is in the column space of $X$ we will not be able to solve this system exactly. So we allow for some error and instead of requiring $y - X\beta = \vec{0}$ we solve $y - X\beta = \vec{\epsilon}$ with the goal of making $||\epsilon||$ and $||\epsilon||^2$ as small as possible. 

The least squares solution that does this is 

\begin{align} \hat{\beta} = (X^TX)^{-1}X^Ty \end{align}

and gives the $\hat{\beta}$ that gets $X\hat{\beta}$ as close to $y$ as possible. 

### II. Least Squares Model for the Babies Data Set

Let's build a least squares model to predict a baby's birthweight using the other variables in the dataset. You can (and should) read about ```babies``` at https://www.openintro.org/data/index.php?data=babies.

In [None]:
# Step 1: Import babies dataset
from google.colab import files

uploaded = files.upload()

Saving babies.csv to babies.csv


In [None]:
# Step 2: View dataset

import pandas as pd

babies = pd.read_csv("babies.csv")
babies.head()
#babies.shape

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke
0,120,284.0,0,27.0,62.0,100.0,0.0
1,113,282.0,0,33.0,64.0,135.0,0.0
2,128,279.0,0,28.0,64.0,115.0,1.0
3,123,,0,36.0,69.0,190.0,0.0
4,108,282.0,0,23.0,67.0,125.0,1.0


#### Data Preprocessing.

Note that there are missing values (```NaN```) in the data. There are several standard methods for dealing with missing data, including

1. Dropping the lines that have missing data.

2. Using the column means to replace the missing data.

3. More advanced methods such as using regression to estimate the missing values.


In [None]:
# Step 3: Preprocess the dataset

babies = babies.fillna(babies.mean()) #fill in the missing value using mean
babies.head()

Unnamed: 0,bwt,gestation,parity,age,height,weight,smoke
0,120,284.0,0,27.0,62.0,100.0,0.0
1,113,282.0,0,33.0,64.0,135.0,0.0
2,128,279.0,0,28.0,64.0,115.0,1.0
3,123,279.338512,0,36.0,69.0,190.0,0.0
4,108,282.0,0,23.0,67.0,125.0,1.0


#### Model Specification and Fit.

Now that the data is imported and cleaned we need to:

1. Specify which linear model we're going to use to fit the data.

2. Fit the least squares model that we chose. 

First we'll try

\begin{align} \hat{bwt} = \hat{\beta_0} + \hat{\beta_1}gestation + \hat{\beta_2}parity + \hat{\beta_3}age + \hat{\beta_4}height + \hat{\beta_5}weight + \hat{\beta_6}smoke. \end{align}


**Exercise**. Interpret $\beta_6$. 

In [None]:
# Step 4: Construct the design matrix and y vector

import numpy as np

X = babies.to_numpy()[:,1:7]# Note that we don't need to do a column of ones since sklearn will do it for us.
y = babies.to_numpy()[:,0]
#y = babies.to_numpy()[0:1236,0]
print(X)
print()
print(y)

[[284.   0.  27.  62. 100.   0.]
 [282.   0.  33.  64. 135.   0.]
 [279.   0.  28.  64. 115.   1.]
 ...
 [291.   0.  30.  65. 150.   1.]
 [281.   1.  21.  65. 110.   0.]
 [297.   0.  38.  65. 129.   0.]]

[120. 113. 128. ... 130. 125. 117.]


#### GLMs with the Scikit-Learn library.

We could find the solution ourselves by doing $\hat{\beta} = (X^TX)^{-1}X^Ty$ but for large design matrices it's better to use a library, such as ```scikit-learn``` to do the work for us. The low-level code in this library's packages is optimized to use advanced methods for finding inverses that tend to be more computationally efficient and numerically stable than solving for $\hat{\beta}$ directly. 

In [None]:
# Step 5: Fit the least squares model
from sklearn.linear_model import LinearRegression

babies_model = LinearRegression().fit(X, y)

print(babies_model.coef_)
print()
print(babies_model.intercept_)

[ 0.44192955 -3.11257571  0.01674357  1.12036543  0.05012439 -7.93922874]

-78.60413623799087


**Exercise**. What bwt does the model predict for a firstborne infant with a gestation of 280 days, was their mother's first pregnancy, mother was 27 years old, 64 inches tall, 120 lbs, and did not smoke during the pregnancy?  

In [None]:
baby = np.array([[280, 0, 27, 64, 120, 1]])

babies_model.predict(baby)

array([115.36729846])

### III. Evaluating Model Fit 

There are an infinite number of models we can use to fit the data. How do we decide which one to go with? There are several comomon metrics for evaluating how well the model fits the data. 

For a given model, the least squares solution minimizes the error vector $\vec{\epsilon}$. So if we are trying to decide between several different models one way to choose which one fits the data the best is to choose the model with the smallest $\vec{\epsilon}$. 

Suppose we have $n$ subjects (rows) in our data set. Then the least squares problem can be viewed as 

\begin{align} \underset{\beta_0, \beta_1, ..., \beta_p}{min} \epsilon_1^2 + \epsilon_2^2 + \epsilon_3^2+ ... + \epsilon_n^2. \end{align}

Our least squares solution $\hat{\beta} = [\hat{\beta_0}, \hat{\beta_1}, ... , \hat{\beta_p}]^T$ gives the smallest sum of squared errors that is possible for the chosen model and data set. 

Once the model has been fit, we can use it to make predictions $X\hat{\beta} = \hat{y}$. That is, $\hat{y}$ is the predictions vector that lies in $C(X)$ that is the closest to the actual data vector $y$. The difference between these two is 

\begin{align} ||\vec{\epsilon}||^2 = ||\vec{y} - \hat{y}||^2 = \sum_{i=1}^n (y_i - \hat{y_i})^2.\end{align} 

This quantity is known as the <u>sum of squared errors</u> ($SSE$).

**Question**. A "good" model fit for the data is one that has a _____ $SSE$ relative to alternative models. 

Other commonly used measures of model fit are:

1. Mean Squared Error: $MSE = \frac{SSE}{n}$.
2. Residual Squared Error: $RSE = \sqrt{\frac{SSE}{n-p-1}}$. (This is closely related to Lab 1 Exercise 6.) 
3. $R^2$ measures model fit as $\frac{TSS−SSE}{TSS}$, where $TSS=\Sigma_{i=1}^n(y_i−\bar{y})^2$. Think of $TSS$ as the total variation in the output variable; our model seeks to explain that variation in terms of the $x_i$ inputs. $SSE$ is the amount of variation that the model doesn’t explain.

**Question**: What would $R^2$ be if the model was able explain all the variation in y?

**Exercise**. Find the $MSE$ and $R^2$ for ```babies_model```. 

In [None]:
# Step 6: Evaluate Model Fit

from sklearn import metrics
import math
#y_hat
y_hat = babies_model.predict(X)
print(y_hat)
print()
mse = metrics.mean_squared_error(y,y_hat)
print(mse)
print()
rse = metrics.r2_score(y,y_hat)
print(rse)


NameError: ignored

**Exercise**. Is mother's ```weight``` useful for predicting a babies' ```bwt```? Answer this question by analyzing the $MSE$ and $R^2$. 

\begin{align} \hat{bwt} = \hat{\beta_0} + \hat{\beta_1}gestation + \hat{\beta_2}parity + \hat{\beta_3}age + \hat{\beta_4}height + \hat{\beta_5}smoke. \end{align}

In [None]:
X1 = np.delete(X,5,1)
print(X1)

babies_model2 = LinearRegression().fit(X1, y)

print(babies_model2.coef_)
print()
print(babies_model2.intercept_)

[[284.   0.  27.  62. 100.]
 [282.   0.  33.  64. 135.]
 [279.   0.  28.  64. 115.]
 ...
 [291.   0.  30.  65. 150.]
 [281.   1.  21.  65. 110.]
 [297.   0.  38.  65. 129.]]
[ 0.45883975 -2.72950291  0.06811411  1.0374483   0.06464495]

-84.51686059284307


In [None]:
from sklearn import metrics
#y_hat
y_hat = babies_model2.predict(X1)
print(y_hat)
print()
mse = metrics.mean_squared_error(y,y_hat)
print(mse)
print()
rse = metrics.r2_score(y,y_hat)
print(rse)

[118.41899814 122.24747305 119.23748433 ... 128.17981097 117.66308574
 130.12021841]

263.86904785472626

0.20592911316388596
