In [None]:
import pandas as pd
from pandas.plotting import scatter_matrix
import numpy as np

from glm.glm import GLM
from glm.families import Gaussian, Bernoulli, Poisson, Gamma


import statsmodels.formula.api as smf
from helper_functions import linear_model_summary


%matplotlib inline
import matplotlib.pyplot as plt

plt.style.use('fivethirtyeight')

# Predictive Linear Regression 

Based on Dan Rupp's Lecture

### The task of predicting quantities: in story form

Let's start a business: **we estimate the fuel economy of cars.** People describe a car and pay us money, and we tell them the fuel economy (in miles per gallon) of their car.

* The first customer comes in, and says, "This is a very competitive field. Why should we pay you?"
* You say, "We are very good at what we do."
* They say, "Your competitor says the same thing. Do you have evidence?"

Note two things.
1. **You are competing against other models.**
2. **Your model will be evaluated quantitatively.**

So you say: "We have analyzed all of our competitors's predictions. All their predictions have some error; the mean of their predictive errors is 5 mpg. We can do better."

* Them: "Ah, interesting. Prove it."
* You: "Describe a car."
* Them: "A 2020 hybrid Ford F250"
* You: "Can't predict - that **that car isn't in our records**"
* Them: "What good is that?"
* You: "If you ask about an old car, we can predict it perfectly."
* Them: "And...why would we pay for that?"
* You: "I don't know, but we are perfect."
* Them: "We'll go to your competitor instead."


3. **The only performance that matters is predictive performance on unseen examples.**


### Practical Examples of Linear Regression 

#### Based on the speed of your vehicle, how many feet do you need to move to come to a complete stop?
<img class =“right” src="images/speed_stopping_distance.png" style="width:400px;height:300px;">

#### How does advertising dollars spent effect a companies profits?
<img class="left" src="images/Marketing_v_sales.gif" style="width:400px;height:300px;">




What are the X-axis and Y-axis?   

What does the line represent?   

### A linear regression:
 - Is a linear combination of coefficents and attributes.  Each coefficent represents a linear relationship between the feature and the target.
 
 For stopping distance:
 $$
 \hat{y_i} = -10 + (18 \times distance)
 $$
 
 For Advertising:
 $$
 \hat{y_i} = 110 + (2 \times add\_dollars)
 $$

<img src="images/Linear_Regression_coefficients.png"> <!-- style="width:800px;height:400px;"> -->

### Terminology:   

- **Dependent Variable** (dependent, target, $y$) - The value you are trying to predict or investigate.   
- **Independent Variable** (values, features, $X$) - The data you are using to try to predict the target.
    - Single observations of the independent variable/variables are denoted with the lower case $x$
- **Fitting a Model** - For linear regression this is finding the line that best fits the training data.    
- **Training Data** - Observations used to train a model (linear regression).   
- **Prediction**  ( $\hat{y}$  ) - An estimation of the target

# Linear Regression

<p>
Models can be used for many purposes.  For the majority of the course, we focus on using models as predictive tools.  In those cases, getting the correct answer is more important than the information that the model avails to us, and as such, we are able to use models of increasing complexity (at the cost of transparency).  
</p>

<h2>An Initial Relationship.</h2>
<p>
We may intuitively know that smaller cars get better gas milage. Much less intuitive is the size of the effect, and how much this effect describes the difference in gas milage from car to car.
</p>

## Loading and Inspecting the Data

For our example we will use a dataset dataset about various specimens of single species of insect collected across two continents.  The data was acquired from this question on CrossValidated, a statistics/data science question and answer site:

[Multiple regression, full and restricted model](http://stats.stackexchange.com/q/267034/74500)

We chose this dataset because it is small enough to be accessible, but has some interesting features for us to discover!

We will be interested in explaining how the insects wing span varies, as influenced by the other measurements in our dataset.

We'd like to load the `insects` data into python.  Our first step is to take a quick look at the raw data.

In [None]:
plt.style.use('ggplot')

In [None]:
!head ./data/insects.csv

It looks like there are four columns in our dataset:

```
continent, latitude, wingsize, and sex
```

Each data element is separated from the next by a tab character, so although it has the `.csv` extension, it is not comma separated.

In [None]:
insects = pd.read_csv('./data/insects.csv', sep='\t')
insects.head()

In [None]:
insects.tail()

We've got our four columns `continent, latitude, wingsize, and sex`.  

We can see some short descriptions of their qualities using `info`:

In [None]:
insects.info()

## Looking at the Data

We can get a first feel for how the quantities in our data are spread out using some histograms.

In [None]:
column_names = {
    "continent": "Continent",
    "latitude": "Latitude",
    "wingsize": "Wing Span",
    "sex": "Sex"
}

fig, axs = plt.subplots(2, 2, figsize=(10, 6))
for ax, (column, name) in zip(axs.flatten(), column_names.items()):
    ax.hist(insects[column])
    ax.set_title(name)

fig.tight_layout()

**Discussion:** What have you learned form the data from these histograms?  How do they help you describe the data?

Some observations:

  - `continent` and `sex` take only two values.  There are two continents represented in the data, labeled zero and one, and there are two sexes (probably Male and Female), also labeled zero and one.
  
These zero/one columns are called **binary** or **indicator variables**, they measure a specific yes/no condition.

  - The values of `wingspan` cluster into two distinct groups.  This is very interesting, and worthy of investigation.

## Scatterplots

Histograms are useful, but limited, as they do not reveal anything about the *relationships between the columns in our data*.  To rectify this we turn to uncountably the most effective and flexible visualization, the **scatterplot**.

A good first step is to use `pandas.plotting.scatter_matrix` to get a global view of our data.

In [None]:
scatter_matrix(insects, figsize=(8, 8), s=100)
plt.show()

The `latitude` vs. `wingsize` scatterplot looks interesting, so let's take a close look at that.

In [None]:
fig, ax = plt.subplots(figsize=(8, 5))

ax.scatter(insects.latitude, insects.wingsize, s=40)
ax.set_xlabel("Latitude")
ax.set_ylabel("Wing Size")
ax.set_title("Insect Wing Sizes at Various Latitudes")

**Discussion:** What patterns do you see in the scatterplot.  Can you form any hypothesis about the data?

Here are some thoughts:
    
  - The most prominent feature of this data is the two bands.  There seem to be two very well defined elongated clusters of data, with the average wingsize in one cluster much greater than in the other.
  - Within each cluster there is noticeable tendency for wingsize first decrease, and then increase as latitude varies.

## Linear Regression

This leads to a few questions we may wish to answer with the data.

  1. Are the two clusters associated with one of the other two variables in the dataset, `continent` or `sex?`
  2. Can we somehow summarize the way that `wingsize` varies with `latitude`?
  
Let's answer each of these questions.

### Are The Two Clusters Associated With Either Continent or Sex?

We can discover if the two clusters in the data are associated with either `continent` or `sex` through a well chosen visualization.  Let's make the same scatterplot from before, but color each point either red or blue, according to the value of `continent` or `sex`.

In [None]:
fig, ax = plt.subplots(figsize=(8, 5))

continent_boolean = insects.continent.astype(bool)
ax.scatter(insects.latitude[continent_boolean], 
           insects.wingsize[continent_boolean], 
           s=40, c="red", label="Continent 1")
ax.scatter(insects.latitude[~continent_boolean], 
           insects.wingsize[~continent_boolean],
           s=40, c="blue", label="Continent 0")
ax.set_xlabel("Latitude")
ax.set_ylabel("Wing Size")
ax.set_title("Are The Two Clusters Associated With Continent?")
ax.legend()

The values of continent seem scattered randomly across the two clusters, so it does **not** seem like continent is associated with the clusters.

In [None]:
fig, ax = plt.subplots(figsize=(8, 5))

def make_insects_scatter_plot(ax):
    sex_boolean = insects.sex.astype(bool)
    ax.scatter(insects.latitude[sex_boolean], 
               insects.wingsize[sex_boolean],
               s=40, c="red", label="Male")
    ax.scatter(insects.latitude[~sex_boolean], 
               insects.wingsize[~sex_boolean],
               s=40, c="blue", label="Female")
    ax.set_xlabel("Latitude")
    ax.set_ylabel("Wing Size")
    ax.set_title("Insect Wing Sizes at Various Latitudes")
    ax.legend()
    
make_insects_scatter_plot(ax)

Consider the visual from our last lecture on Cross Validation. Notice we always come back to buisness understanding and data understanding. If you first build a model without basic eda and statistical investigations we likely miss concepts like this. 

<img src="images/800px-CRISP-DM_Process_Diagram.png" alt="Drawing" style="width: 400px;"/>

There we go!

This is pretty definitive, the cluster of the larger insects are all female, and the cluster of smaller insects are all male.  This seems like enough evidence to conclude that the sex of the insect causes the data to cluster into two groups.

**Note:** how little technology we needed to make this point convincingly.  It is *very important* to explore your data and use it to *ask and then answer* question like this.  Many data scientists reach immediately for their most powerful tools, which often leaves them with little to say when asked simple questions.

### Is An Increasing Latitude Associated With an Increasing Wing Size?

This question is a little more sophisticated, and we need some new technology to answer it.

The idea is to create an equation:

$$ \text{Wing Span} \approx a + b \times \text{Latitude} $$

Then we can look at the number $b$, which tells us how we should expect `wingspan` to change as `latitude` changes.  If we find that $b > 0$, that's evidence that an increasing latitude is associated with an increasing wingspan.

So we can use Linear Regression

In [None]:
import statsmodels.api as sm


Please note that we are adding the constant here so we get an intercept term.

In [None]:
X = sm.add_constant(insects['latitude'])
y = insects['wingsize']

In [None]:
X.head()

In [None]:
model = sm.OLS(y, X)
results = model.fit()
results.summary()

The linear regression has attempted to estimate the equation we are after, and it has returned:

$$ \text{Wing Span} \approx 765.20 + 2.54 \times \text{Latitude} $$

So we can expect an on average increase of $2.54$ wingspan for every increase in one latitude.

The numbers estimated in linear regression are called **parameter estimates** or **coefficient estimates** and are usually denoted with the Greek letter $\beta$:

$$y \approx \beta_0 + \beta_1 x $$

The parameter estimate with no associated variable is usually called the **intercept**:

$$ \text{Wing Span} \approx \underbrace{765.20}_{\text{Intercept}} + \underbrace{2.54}_{\text{Parameter Estiamte}} \times \text{Latitude} $$


#### As the Equation of a Line

One way we can visualize this is to look at the regression as returning to us the equation for a line.  This line is the **best summary of the data** (under the assumption that a line is a reasonable way to summarize the data).

In [None]:
#This is from a old library to do lin regression.  Do not worry about it.
linear_model = GLM(family=Gaussian())
linear_model.fit(insects, formula='wingsize ~ latitude')


fig, ax = plt.subplots(figsize=(8, 5))

make_insects_scatter_plot(ax)

# Make a line graph of the predictions.
def make_insects_model_line(ax, label="Linear Regression"):
    x = np.linspace(30, 60, num=250)
    ax.plot(x, linear_model.coef_[0] + linear_model.coef_[1] * x,
           linewidth=2, c="black", label=label)
    ax.set_xlim(30, 60)

make_insects_model_line(ax)

**Discussion:** Does this model have any issues?  If so, what are they?

This plot shows two serious flaws in our model:

  - It has no knowledge of the sex of the insect, so the fit line attempts to bisect the two clusters of data.
  - It cannot account for the curvature in the data points.  The model attempts to fit a line to data that does not have a linear shape.

## Accounting for the Sex of the Insect: Binary Predictors

It would much better to take account of the sex of the index and fit two lines, we make one line of prediction of the wing size given the latitude for males, and another for females.

The easiest way to do this is to modify our equation:

$$ \text{Wing Span} \approx a + b \times \text{Latitude} + c \times \text{Sex} $$

There is now another term: if the insect is male we *add $c$* to the prediction, otherwise we add nothing.


In [None]:
X = sm.add_constant(insects[['latitude','sex']])
y = insects['wingsize']

model = sm.OLS(y, X)
results = model.fit()
results.summary()

A couple of points are important:

- We now have an estimate for the number $c$ of $-88.03$.  This means, that on average, being male costs an insect about $-88.03$ in wingspan.
- The parameter estimates for the `Intercept` and for `latitude` **have changed**.  This is a very common situation.  When we fit a model with multiple variables, the model accounts for both how the variables are related to $y$, **and** how they are related to **each other**.

The predictions from this model now depend on whether an insect is male or female, meaning that we can draw one line for males, and one line for females.

$$
\text{Wing Span} \approx 948.25 + -0.41 \times \text{Latitude} + -88.03 \times \text{Sex}
$$
<br/>
<br/>
<br/>
The calculation for female:
$$
\text{Wing Span} \approx 948.25 + -0.41 \times \text{Latitude} + -88.03 \times 0
$$
or:
$$
\text{Wing Span} \approx 948.25 + -0.41 \times \text{Latitude}
$$
<br/>
<br/>
<br/>
<br/>
For male:
$$
\text{Wing Span} \approx 948.25 + -0.41 \times \text{Latitude} + -88.03 \times 1
$$
or:
$$
\text{Wing Span} \approx 948.25 + -0.41 \times \text{Latitude}+ -88.03 
$$

## This is important when thinking about inferential regression
- The better the model in inferential regression the better the coefficients 

In [None]:

insects_model_with_sex = GLM(family=Gaussian())
insects_model_with_sex.fit(insects, formula='wingsize ~ latitude + sex')

fig, ax = plt.subplots(figsize=(8, 5))

make_insects_scatter_plot(ax)

def make_insects_model_lines(ax):
    x = np.linspace(30, 60, num=250)
    ax.plot(x, insects_model_with_sex.coef_[0] 
                 + insects_model_with_sex.coef_[1] * x,
           linewidth=2, c="blue")
    ax.plot(x, insects_model_with_sex.coef_[0] 
                 + insects_model_with_sex.coef_[1] * x + insects_model_with_sex.coef_[2],
           linewidth=2, c="red")
    ax.set_xlim(30, 60)
    ax.set_xlabel("Latitude")
    ax.set_ylabel("Wing Size")
    ax.set_title("Insect Wing Sizes at Various Latitudes")
    ax.legend()
    
make_insects_model_lines(ax)

The model is now fitting much better to the data, but the curvature of the scatterplots is still an issue.

## Accounting for the Curvature of the Data Points: Transformations

We can account for the curvature of the data points by using a *polynomial regression*.  This means that we fit powers of latitude bigger than one:

$$ \text{Wing Span} \approx a + b \times \text{Latitude} + c \times \text{Latitude}^2 +  d \times \text{Sex} $$

In [None]:
X = insects[['latitude','sex']]
X['latitude**2'] = X['latitude']**2
X = sm.add_constant(X)

y = insects['wingsize']

X.head()

In [None]:
model = sm.OLS(y, X)
results = model.fit()
results.summary()

In [None]:
insects_model_quad = GLM(family=Gaussian())
insects_model_quad = linear_model.fit(
    insects, 
    formula='wingsize ~ latitude + I(latitude**2) + sex')

fig, ax = plt.subplots(figsize=(8, 5))

# Make a scatterplot of the data.
make_insects_scatter_plot(ax)

def make_insects_model_quadratic(ax):
    x = np.linspace(30, 60, num=250)
    ax.plot(x, insects_model_quad.coef_[0] 
                 + insects_model_quad.coef_[1] * x
                 + insects_model_quad.coef_[2] * x*x,
           linewidth=2, c="blue")
    ax.plot(x, insects_model_quad.coef_[0] 
                 + insects_model_quad.coef_[1] * x
                 + insects_model_quad.coef_[2] * x*x
                 + insects_model_quad.coef_[3],
           linewidth=2, c="red")
    ax.set_xlim(30, 60)
    ax.set_xlabel("Latitude")
    ax.set_ylabel("Wing Size")
    ax.set_title("Insect Wing Sizes at Various Latitudes")
    ax.legend()
    
make_insects_model_quadratic(ax)

Now we have a good fit to our data.

**Discussion:** Should we go further and add higher degree terms into the model?  Why or why not?  What would happen if do?

## How will adding continent change the model?

Now we have four different possible combinations of indicator variables

  - `sex == 0 and continent == 0`
  - `sex == 1 and continent == 0`
  - `sex == 0 and continent == 1`
  - `sex == 1 and continent == 1`
  
Which results in four curves being fit (though the **shape** of the quadratic trend is the same for each, as the parameters associated with latitude are **shared**).

In [None]:
insects.head()

In [None]:
insects_model_quad_with_continent = GLM(family=Gaussian())
insects_model_quad_with_continent.fit(
    insects,
    formula='wingsize ~ latitude + I(latitude**2) + sex + continent')
insects_model_quad_with_continent.summary()

In [None]:
fig, ax = plt.subplots(figsize=(8, 5))

# Make a scatterplot of the data.
make_insects_scatter_plot(ax)

def make_insects_model_quadratic_and_continent(ax):
    x = np.linspace(30, 60, num=250)
    ax.plot(x, insects_model_quad.coef_[0] 
                 + insects_model_quad_with_continent.coef_[1] * x
                 + insects_model_quad_with_continent.coef_[2] * x*x,
           linewidth=2, c="blue")
    ax.plot(x, insects_model_quad_with_continent.coef_[0] 
                 + insects_model_quad_with_continent.coef_[1] * x
                 + insects_model_quad_with_continent.coef_[2] * x*x
                 + insects_model_quad_with_continent.coef_[4],
           linewidth=2, c="blue", linestyle="--")
    ax.plot(x, insects_model_quad.coef_[0] 
                 + insects_model_quad_with_continent.coef_[1] * x
                 + insects_model_quad_with_continent.coef_[2] * x*x
                 + insects_model_quad_with_continent.coef_[3],
           linewidth=2, c="red")
    ax.plot(x, insects_model_quad.coef_[0] 
                 + insects_model_quad_with_continent.coef_[1] * x
                 + insects_model_quad_with_continent.coef_[2] * x*x
                 + insects_model_quad_with_continent.coef_[3]
                 + insects_model_quad_with_continent.coef_[4],
           linewidth=2, c="red", linestyle="--")
    ax.set_xlim(30, 60)
    ax.set_xlabel("Latitude")
    ax.set_ylabel("Wing Size")
    ax.set_title("Insect Wing Sizes at Various Latitudes")
    ax.legend()
    
make_insects_model_quadratic_and_continent(ax)

The effect of the `continent` variable is seen here as the difference between the solid and dashed lines.  It's evident from this plot that `continent` is not very useful in describing the `wingsize` of our insects.

## How about categorical fields that are not binary?
 - You have a survey with feed back options:
   * 1) Poor
   * 2) Fair
   * 3) Good
   * 4) Excellent    
<br/>
  <br/>      
 - You have a field with eye color:
   * 1) Brown
   * 2) Green 
   * 3) Blue
   * 4) Grey
   

### How may we handle the `origin` field in the cars dataset?

In [None]:
cars = pd.read_csv('data/cars_multivariate.csv', na_values=['?'])
cars = cars[cars.horsepower.notnull()]
cars.head()

We can create binary indicators as to if it is of a given origin

In [None]:
pd.get_dummies(cars['origin'], prefix='origin')

Do I need all three indicator columns?

In [None]:
pd.get_dummies(cars['origin'], prefix='origin', drop_first=True)

## Important note. 
Many times it is useful to look at our 'residuals' after making a model to see what they look like.

In [None]:
# Create a linear regression object
X = cars['weight']
X = sm.add_constant(X)
y = cars['mpg']

regressor = sm.OLS(y,X)
regressor = regressor.fit()
regressor.summary()

In [None]:
# Plot the line along with the data
slope = -.0076
intercept = 46.2165
ax = cars.plot('weight','mpg',kind='scatter')
xx = np.linspace(1000, 5500, 100)
ax.plot(xx, xx*slope + intercept, color='red', lw=3)
_ = ax.set_xlim([1000,5500])

### Many times when modeling the resudals plots will be informative

In [None]:
X["y_hat"] = regressor.predict(X)
X["Residuals"] = y - X["y_hat"]
# Plot the line along with the data
ax = X.plot('weight','Residuals',kind='scatter')
ax.plot(xx, [0]*100, color='red', lw=3)
_ = ax.set_xlim([1500,5500])

In [None]:
X_sq = pd.DataFrame({'weight' : cars['weight'], 'weight_sq' : cars['weight']**2})
X_sq = sm.add_constant(X_sq)

reg_sq = sm.OLS(y, X_sq).fit()
reg_sq.summary()

In [None]:
# Plot the line along with the data
slope = -.0185
slope_sq = 1.697e-6
intercept = 62.255


ax = cars.plot('weight','mpg',kind='scatter')
xx = np.linspace(1000, 5500, 100)
ax.plot(xx, xx*xx*slope_sq + xx*slope + intercept, color='red', lw=3)
_ = ax.set_xlim([1000,5500])

In [None]:
X_sq["y_hat"] = reg_sq.predict(X_sq)
X_sq["Residuals"] = y - X_sq["y_hat"]


# Plot the line along with the data
ax = X_sq.plot('weight','Residuals',kind='scatter')
ax.plot(xx, [0]*100, color='red', lw=3)
_ = ax.set_xlim([1500,5500])

In [None]:
y.shape

In [None]:
X_sq['Residuals'].shape

In [None]:
X_sq['y'] = y

# Plot the line along with the data
ax = X_sq.plot('y_hat','Residuals',kind='scatter')
ax.hlines(0,10,45)


In [None]:
# Plot the line along with the data
ax = X_sq.plot('y','Residuals',kind='scatter')
ax.hlines(0,10,45)
