# Exploratory Data Analysis and Regression
## Introduction to Data Science
### Kigali, Rwanda
### July 8th, 2019


#### Import libraries

<img src="fig/logos.jpg">

## Outline
1. Linear Regression
2. Multi-linear and Polynomial Regression

# Linear Regression

## What is a Model?

This is a scatter plot of home prices vs square footage of some homes in southern California.

<img src="fig/fig32.jpg" style="height:350px;">

Can you see any patterns or trends?


## What is a Model?

We see that as **square footage** increases, so does **price**. 

<img src="fig/fig32.jpg" style="height:350px;">

But what is a precise, mathematical description of this relationship?

## What is a Model?

Building a model to capture a hypothesized relationship means we predict the value of one group of attributes using another group. 

This prediction problem is called ***regression***, the attribute we are trying to predict (e.g.price) is called the ***outcome*** or the ***target***, denoted by $y$. 

The group of attributes (e.g. square footage) we use to make the prediction is called the ***covariates***, denoted by $x$.

A ***regression model*** is a mathematical function, $f(x)$, that predicts the target. We denote our prediction by $\hat{y} = f(x)$. 

## What is a Model?

We conjectured that the model for this data is a line: $\hat{y} = f(x) = w_1x + w_0$.

<img src="fig/fig33.jpg" style="height:350px;">

But which line fits the data best?

## A Notion of Error

An ***absolute residual*** is the absolute difference between the actual price of a home and the price predicted by the line for a given square footage:
$$
\mathtt{Residual}_i = y_i - \hat{y}_i
$$

<img src="fig/fig34.jpg" style="height:350px;">

## How do we quantify the overall error?

1. **(Max absolute deviation)** Count only the biggest "error"
$$
\max_i |y_i - \hat{y}_i| 
$$
2. **(Sum of absolute deviations)** Add up all the "errors"
$$
\sum_i |y_i - \hat{y}_i| 
$$
3. **(Sum of squared errors)** Add up the squares of the "errors"
$$
\sum_i |y_i - \hat{y}_i|^2 
$$
4. **(Mean squared errors)** We can also average the squared "errors".
$$
\frac{1}{N}\sum_{i=1}^N |y_i - \hat{y}_i|^2 
$$

Again, $y_i$ is the observed target, $\hat{y}_i$ is the predicted target.

## Model Fitting

**Question:** What do we mean by choosing "best" line, $\hat{y} = w_1x_1 + w_0$? 

The ***model fitting*** process:

1. *Choose* an overall error metric. This metric is called the ***loss function***:
$$
\mathcal{L}(w_0, w_1) = \frac{1}{N}\sum_{i=1}^N |y_i - (w_1x_1 + w_0)|^2 
$$

2. Set up the problem of finding coefficients or ***parameters***, $w_0, w_1$, such that the loss function is **minimized**:
$$
\min_{w_0, w_1}\mathcal{L}(w_0, w_1) = \min_{w_0, w_1}\frac{1}{N}\sum_{i=1}^N |y_i - (w_1x_1 + w_0)|^2 
$$

3. Choose a method of minimizing the loss function.

**Note:** For linear regression, we can minimize $\mathcal{L}$ analytically. We cannot do this for every model!

## Linear Regression in `sklearn`

```python
# import the LinearRegression model from the sklearn library
from sklearn.linear_model import LinearRegression

# make an instance of the linear regression model
regression = LinearRegression()

# find the coefficients for the line that minimizes mean squared error
regression.fit(x_train, y_train)
```

## Model Evaluation

After fitting the model (finding coefficients that minimize the loss function), we need to **check the error of the model**. Why?
<img src="fig/fig36.jpg" style="height:300px;">

## Model Evaluation on Train and Test

Rather than computing the mean square error (which depends on units), we often compute the ***coefficient of determination*** or the $R^2$ of our model. 

This is a number between 0 and 1, indicating the percentage of the data variation captured by our model.

<img src="fig/fig37.jpg" style="height:200px;">

## Where Does Test Data Come From?

Typically, data is collected once. New data for testing your model maybe expensive or impossible to collect. 

In practice, we split our data into two: one we use for training our models, and one we set aside for **final testing** after a model has been chosen.

```python
# Import function for splitting data into train and test
from sklearn.model_selection import train_test_split

# split our data into training and testing data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=0)
```

## Model Interpretation

In addition to evaluating our model on training and testing data, we must also examine the coefficients themselves. Why?

<img src="fig/fig35.jpg" style="height:300px;">


## Preprocessing: Scaling the Data

If we are fitting a linear regression model 
$$\hat{y} = w_0 + w_1x,$$ 
the units (and hence the scale) of $x$ will affect the magnitude of the coefficients $w_0, w_1$.

To prevent learning non-sensical coefficients, we often scale the data to a fixed range before fitting our model, this is called ***preprocessing***.

## Data Preprocessing in `sklearn`

```python
# import the model for scaling data
from sklearn.preprocessing import MinMaxScaler

# make an instance of the scaler
scaler = MinMaxScaler(0, 1)
scaler.fit(x_train)
# scale training data
x_train = scaler.transform(x_train)
# scale testing data
x_test = scaler.transform(x_test)
```

# Multi-Linear Regression

## Linear Regression with Multiple Covariates

It’s unreasonable for price of a home to depend on square footage alone. In reality, price most likely depends on some combination of square footage, $x_1$, number of bedrooms, $x_2$, and the number of bathrooms, $x_3$.

$$
\hat{y} = f_W(x_1, x_2, x_3) = w_0 + w_1x_1 + w_2x_2 + x_3x_3
$$

<img src="fig/fig38.jpg" style="height:300px;">

## Fitting Linear Regression Models with Multiple Covariates

Again, fitting the model means finding coefficients $W = [w_0, w_1, w_2, w_3]$ to minimize mean squared error:

$$
\min_{W}\mathcal{L}(W) = \min_{W}\frac{1}{N}\sum_{i=1}^N |y_i - f_W(x)|^2 
$$

# Polynomial Regression

## Polynomial Models for Regression

You might notice that our linear models (univariate and multivariate) don’t seem to fit the housing data very well.

Maybe this is because the underlying relationship between price and square footage, $x$, isn’t linear. Perhaps the model we want is the polynomial:

$$
\hat{y} = f(x) = w_0 + w_1 x + w_2 x^2
$$

Does this mean that we need to define a new model?

## Polynomial Regression as Linear Regression

While the function $f(x) = w_0 + w_1 x + w_2 x^2$ is degree 2 in the input $x$, it is **linear** in the input $[x, x^2]$. That is, if we transformed the data into $[x_i, x_i^2]$ for each $x_i$, then we can fit a linear regression model:
$$
g(x, x^2) = w_0 + w_1 x + w_2 x^2
$$

## Fitting a Polynomial Regression Model

1. transform the training data into polynomial features: $x \to [x_i, x_i^2]$
<img src="fig/fig40.jpg" style="height:200px;">
2. fit a linear regression model:
$$
g(x, x^2) = w_0 + w_1 x + w_2 x^2
$$

## Polynomial Regression in `sklearn`

```python
# import model to transform data into polynomial features
from sklearn.preprocessing import PolynomialFeatures

# set the polynomial degree
degree_of_polynomial = 2
# make an instance of the sklearn model for transforming features into polynomial features
polynomial_transform = PolynomialFeatures(degree_of_polynomial, include_bias=False)

polynomial_transform.fit(x_train)
# transform x_train to polynomial x_train
x_train_poly = polynomial_transform.transform(x_train)
# transform x_test to polynomial x_test
x_test_poly = polynomial_transform.transform(x_test)

# fit a linear regression model to the polynomial features
regression.fit(x_train_poly, y_train)
```