### Linear Regression

Other Sources:
- Here is a <a href="https://www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/beginners-guide-regression-analysis-plot-interpretations/tutorial/">nice introductory article</a>.  It gives some additional attention to examining and evaluating the regression model. The code is written in R rather than Python, but it shouldn't be hard to interpret the few lines of R code. 
- Text section 18.6 cover the material theoretically and quickly.  Don't worry about the material on gradient descent.
- Chapter 3 of <a href="https://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf">An Introduction to Statistical Learning</a> is your best bet for a deeper treatment, even though R is the language of choice.  In fact, that should be your go-to book for more machine learning material.

Objectives:
- Define data modeling and simple linear regression
- Show various steps in examining a data set and preparing it for input to a learning library
- Build a linear regression model using Python libraries
- Understand how to evaluate the quality of a model and compare it alternative models

##### The Bikeshare Data Set

We'll be working with a data set from Capital Bikeshare that was used in a [Kaggle competition](https://www.kaggle.com/c/bike-sharing-demand/data), but heavily modified for this class!

The goal is to predict total ridership of Capital Bikeshare in any given hour.

### Capital Bikeshare Data Dictionary

| Variable| Description |
|---------|----------------|
|rental_hour| hourly date + timestamp  |
|season|  'winter', 'spring', 'summer', 'fall' |
|workingday| whether the day is neither a weekend nor holiday|
|weather| 1 -> Clear or partly cloudy;  2 -> Clouds and mist; 3 -> Light rain or snow;  4 -> Heavy rain or snow|
|temp| temperature in Celsius|
|humidity| relative humidity|
|windspeed| wind speed|
|total_rentals| number of total rentals|
|sunspots| sunspot activity (0-2, none to high)|


In [14]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['font.size'] = 14

In [15]:
bikes = pd.read_csv('bikeshare.csv', parse_dates=['rental_hour'])

In [16]:
bikes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 9 columns):
rental_hour      10886 non-null datetime64[ns]
season           10886 non-null object
workingday       10886 non-null int64
weather          10886 non-null int64
temp             10886 non-null float64
humidity         10886 non-null int64
windspeed        10886 non-null float64
total_rentals    10886 non-null int64
sunspots         10886 non-null int64
dtypes: datetime64[ns](1), float64(2), int64(5), object(1)
memory usage: 765.5+ KB


In [17]:
bikes.head(5)

Unnamed: 0,rental_hour,season,workingday,weather,temp,humidity,windspeed,total_rentals,sunspots
0,2011-01-01 00:00:00,winter,0,1,9.84,81,0.0,16,0
1,2011-01-01 01:00:00,winter,0,1,9.02,80,0.0,40,1
2,2011-01-01 02:00:00,winter,0,1,9.02,80,0.0,32,1
3,2011-01-01 03:00:00,winter,0,1,9.84,75,0.0,13,0
4,2011-01-01 04:00:00,winter,0,1,9.84,75,0.0,1,1


### Linear Regression Basics
---

### Form of Linear Regression

Think of our data frame as a set of observations of the form:  
$$(x_{i1}, x_{i2}, \ldots, x_{ik}, y_i)$$ 

where the $x$ are the "independent variables" and $y$ is the "dependent variable".  We are trying to find a function of the $x$ variables that predict the value of the $y$ variable.   We have $n$ observations and $k$ independent variables (features).

*  *Regression*  the $y$ variable is numeric and ordered -- for example temperature or number of rentals
*  *Classification* the $y$ variable is taken from a set -- for example true/false or {red, green, blue}

*Linear* -- Start with the formula for a line:  $$y = \alpha + \beta x\\y = \beta_0 + \beta_1 x$$ 
(The second notation gives a consistent name to the coefficients.)

* First statement of the linear regression problem:  find the values of $\beta_0$ and $\beta_1$ that best predict the $y$ values for all of our observations
* Second statement of the problem:  since every model contains some amount of noise -- "random irreducible error" $\epsilon$,  instead we are actually working optimizing $$y = \beta_0 + \beta_1  + \epsilon$$
where $\epsilon$ is small and uncorrelated with $x$ and $y$.

---

Here, we will generalize this to $n$ independent variables as follows:

$$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n + \epsilon\quad =\quad \beta_0 + (\sum_{i=0}^{k} \beta_ix_i) + \epsilon$$

- $y$ is the response.
- $\beta_0$ is the intercept.
- $\beta_1$ is the coefficient for $x_1$ (the first feature).
- $\beta_k$ is the coefficient for $x_k$ (the kth feature).
- $\epsilon$ is the _error_ term which is independent of x, y, and $k$


A practical example of this applied to our data might be:

$${\tt total\_rides} = 20 + -2 \cdot {\tt temp} + -3 \cdot {\tt windspeed}\ +\ ...\ +\ 0.1 \cdot {\tt sunspots}$$

This equation is still called **linear** because the highest degree of the independent variables (e.g. $x_i$) is 1. Note that because the $\beta$ values are constants, they will not be independent variables in the final model, as seen above.

<span style="color:blue; font-size:120%"> What are the limits of this linearity assumption?  What kind of relationships can't we capture?  Can you think of real examples where linearity is violated?</span>

---

In the regression equation
$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n + \epsilon$

the $\beta$ values are called the **model coefficients**:

- These values are estimated (or "learned") during the model fitting process using the **least squares criterion**.
- Specifically, we are trying to find the line (mathematically) that minimizes the **sum of squared residuals** (or "sum of squared errors").
- Once we've learned these coefficients, we can use the model to predict the response.

![Estimating coefficients](estimating_coefficients.png)

We measure how well the model fits the data in terms of mean-squared error as follows:
$$MSE = \frac{1} {n} \| \hat{y}(\mathbf{X}) - \vec{y} \|$$

The linear regression algorithm finds the values for $\beta_0, \beta_1 + \beta_2 + ... + \beta_n$ that maximize MSE *for a given data set (the training set)*

In the diagram above:

- The black dots are the **observed values** of x and y.
- The blue line is our **least squares line**.
- The red lines are the **residuals**, which are the vertical distances between the observed values and the least squares line.

If there are two predictors, the model fits a plane, and residuals are the distance between the observed value and the least squares plane.

![Regression with Two Variates](multiple_regression_plane.png)

If there are more than two predictors, it's hard to visualize.

### Running a Regression using scikit-learn

To fit the skitkit-learn data we have to make our observation data fit its model

1. Features and response should be separate objects
2. Features and response should be entirely numeric
3. Features and response should be NumPy arrays (or easily converted to NumPy arrays)
4. Features and response should have specific shapes (outlined below)

<a id="building-a-linear-regression-model-in-sklearn"></a>
### Building a (Single) Linear Regression Model in sklearn

#### Create a feature matrix called X that holds a `DataFrame` with only the temp variable and a `Series` called y that has the "total_rentals" column.

In [47]:
# Create X and y  try to predict total rentals from temperature (only)
feature_cols = ['temp']
X1 = bikes[feature_cols]
y1 = bikes.total_rentals

<span style="color:blue; font-size:120%">Why do you think X is capitalized but y is not?</span>

In [48]:
print(type(X1))
print(type(y1))
print(X1.shape)
print(y1.shape)

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>
(10886, 1)
(10886,)



### scikit-learn's Four-Step Modeling Pattern -- Same for Many Prediction Algorithms

**Step 1:** Import the class you plan to use.

In [20]:
from sklearn.linear_model import LinearRegression

In [50]:
# Make an instance of a LinearRegression object.
lr1 = LinearRegression()
type(lr1)

sklearn.linear_model.base.LinearRegression

- Created an object that "knows" how to do linear regression, and is just waiting for data.

**Step 3:** Fit the model with data (aka "model training").

- Model is "learning" the relationship between X and y in our "training data."
- Process through which learning occurs varies by model.
- Occurs in-place.

In [51]:
lr1.fit(X1, y1)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

- Once a model has been fit with data, it's called a "fitted model."
- Next steps
  - We can predict y values from an X vector
  - We need to assess how good the model is as a predictor

In [52]:
print(lr1.intercept_)
print(lr1.coef_)

6.04621295961681
[9.17054048]


This is the equation

${\tt total\_rentals} = 6.04 + 9.17 \times{\tt temp}$

In [53]:
# If we predict at x=0 we will get the intercept
print(lr1.predict(np.array([0]).reshape(-1,1)))

[6.04621296]


In [54]:
# We can predict at multiple points.
# We should see a bump of 9.17 for every increase in X

print(lr1.predict(np.array([0,1,2]).reshape(-1,1)))

[ 6.04621296 15.21675344 24.38729392]


In [55]:
# Most often we want to predict all values in a matrix
print(lr1.predict(X1))

[ 96.2843313   88.7644881   88.7644881  ... 133.88354727 133.88354727
 126.36370408]


Interpreting the intercept ($\beta_0$):

- It is the value of $y$ when all independent variables are 0.
- Here, it is the estimated number of rentals when the temperature is 0 degrees Celsius.
- <span style="color:blue">**Note:** It does not always make sense to interpret the intercept. (Why?)</span>

Interpreting the "temp" coefficient ($\beta_1$):

- **Interpretation:** An increase of 1 degree Celcius is _associated with_ increasing the number of total rentals by $\beta_1$.
- Here, a temperature increase of 1 degree Celsius is _associated with_ a rental increase of 9.17 bikes.
- This is not a statement of causation.
- $\beta_1$ would be **negative** if an increase in temperature was associated with a **decrease** in total rentals.
- $\beta_1$ would be **zero** if temperature is not associated with total rentals.

#### More Features!

In [30]:
bikes.columns

Index(['rental_hour', 'season', 'workingday', 'weather', 'temp', 'humidity',
       'windspeed', 'total_rentals', 'sunspots'],
      dtype='object')

In [65]:
# These are all the numeric features -- regression needs numbers!
feature_cols2 = ['workingday', 'weather', 'temp', 'humidity', 'windspeed', 'sunspots']
X2 = bikes[feature_cols2]
y2 = bikes.total_rentals

In [58]:
lr2 = LinearRegression()
lr2.fit(X2, y2)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [59]:
print(lr2.intercept_)
print(lr2.coef_)

180.96434370104075
[-1.35048855  3.09471515  8.7489077  -2.75668603  0.32618071 -3.43731298]


In [46]:
X.shape

(10886, 6)

In [61]:
from sklearn.metrics import mean_squared_error
print(mean_squared_error(y2, lr2.predict(X2)))
print(mean_squared_error(y1, lr1.predict(X1)))

24881.290399662896
27705.2238053288


####  A Quick Diversion:  MSE vs R-Squared

We are minimizing MSE, but MSE is sensitive to magnitudes of the X variables
  * Temperature in F instead of C
  * Multiplying 'sunspot activity' by 10
  
This intuitively doesn't affect the quality of the model.

Instead use $R^2$ (R-squared) which normalizes for X magnitude
  * $R^2$ guaranteed to be $\leq$ 1
  * Values close to 1 are better
  * Value of 0 means model is no better than ignoring $X$ completely and predicting a $y$ value equal to the mean
  * Negative value means the model is *worse* than predicting at the mean

In [62]:
from sklearn.metrics import r2_score
print(r2_score(y2, lr2.predict(X2)))
print(r2_score(y1, lr1.predict(X1)))

0.2416621840009524
0.15559367802794855


In [68]:
list(zip(feature_cols2, lr2.coef_))

[('workingday', -1.3504885452995659),
 ('weather', 3.0947151494371967),
 ('temp', 8.748907695621075),
 ('humidity', -2.756686028546445),
 ('windspeed', 0.3261807055805291),
 ('sunspots', -3.4373129750294273)]

#### Standard Regression Output

In [69]:
import statsmodels.api as sm
X2C = sm.add_constant(X2)
results = sm.OLS(y2, X2C).fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:          total_rentals   R-squared:                       0.242
Model:                            OLS   Adj. R-squared:                  0.241
Method:                 Least Squares   F-statistic:                     577.8
Date:                Wed, 20 Nov 2019   Prob (F-statistic):               0.00
Time:                        14:06:42   Log-Likelihood:                -70540.
No. Observations:               10886   AIC:                         1.411e+05
Df Residuals:                   10879   BIC:                         1.411e+05
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        180.9643      8.471     21.363      0.0

###  Revisit Our Attributes

* 'rental_hour' a timestamp  -- OUT
* 'season'  a string (four values)  -- OUT
* 'workingday' a 0-1 boolean -- IN
* 'weather' ranges from 1 to 4 -- IN
* 'temp' real-valued -- IN
* 'humidity' real-valued -- IN
* 'windspeed' real-valued -- IN
* 'sunspots'  0 to 2 from "none" to "high" -- IN

Remember 
  * Regression assigns a coefficient to each $x_i$
  * The coefficient $\beta_i$ means "if I increase (decrease) $x_i$ by 1 unit, I expect $y$ to increase (decrease) by $\beta_i$ units
  
For which of these variables does this make sense?

### Numeric vs Categorical Variables

*  Numeric $x$
  *  It make sense to think about $x+1$ or even $x \div 2$
  *  It always must be the case the $x$ increasing has the same effect on $y$
  
* Categorical $x$
  *  Selected from a (usually) small set of values
    *  \{ red, white, blue \}
    * \{ true, false \} 
    
*** NOTICE:  Just because your data frame has something coded as an integer doesn't make it numeric!! ***

#####  Is our Y variable categorical or is it numeric?  Must it be one or the other always??

Which of our X variables are numeric, and which are categorical?


#### Dealing With the Categorical Variables

* season -- string (four values)
* workingday -- 0/1
* weather (1-4)
* sunspots (0-2)

In [70]:
bikes['season'].value_counts()

fall      2734
summer    2733
spring    2733
winter    2686
Name: season, dtype: int64

In [71]:
bikes['season'] = bikes.season.replace({'fall': 0, 'summer': 1, 'spring': 2, 'winter': 3})

In [73]:
bikes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 9 columns):
rental_hour      10886 non-null datetime64[ns]
season           10886 non-null int64
workingday       10886 non-null int64
weather          10886 non-null int64
temp             10886 non-null float64
humidity         10886 non-null int64
windspeed        10886 non-null float64
total_rentals    10886 non-null int64
sunspots         10886 non-null int64
dtypes: datetime64[ns](1), float64(2), int64(6)
memory usage: 765.5 KB


In [80]:
bikes.season.value_counts()

AttributeError: 'DataFrame' object has no attribute 'season'

In [76]:
season_dummies = pd.get_dummies(bikes.season, prefix='season')

In [77]:
season_dummies.sample(n=5)

Unnamed: 0,season_0,season_1,season_2,season_3
6229,0,0,0,1
1413,0,0,1,0
3235,0,1,0,0
5461,0,0,0,1
9334,0,1,0,0


In [78]:
season_dummies.drop(season_dummies.columns[0], axis=1, inplace=True)
bikes = pd.concat([bikes, season_dummies], axis=1)
bikes.drop(['season'], axis=1, inplace=True)

In [79]:
bikes.head(5)

Unnamed: 0,rental_hour,workingday,weather,temp,humidity,windspeed,total_rentals,sunspots,season_1,season_2,season_3
0,2011-01-01 00:00:00,0,1,9.84,81,0.0,16,0,0,0,1
1,2011-01-01 01:00:00,0,1,9.02,80,0.0,40,1,0,0,1
2,2011-01-01 02:00:00,0,1,9.02,80,0.0,32,1,0,0,1
3,2011-01-01 03:00:00,0,1,9.84,75,0.0,13,0,0,0,1
4,2011-01-01 04:00:00,0,1,9.84,75,0.0,1,1,0,0,1


####  What about weather (it's kind of categorical)
####  What about sunspots (it's kind of categorical)

#### Dealing with rental_hour

* Rental hour is the hour at which the $y$ variable was measured
* It is strictly increasing over time
  * Is that like an index or row number?   And if so do we keep it?
  * Another question:  is there simply a time effect (business is expanding over time, independent of other factors)
* There certainly should be time effects (i.e. attributes that depend only on time / rental_hour)
  * workingday is a time effect
  * others?

In [81]:
bikes.rental_hour[0]

Timestamp('2011-01-01 00:00:00')

In [86]:
bikes.rental_hour.map(lambda t: t.hour).min()

0

In [90]:
daytime = bikes.rental_hour.apply(lambda t: t.hour > 8 and t.hour < 18).replace({False: 0, True:1})

In [92]:
bikes['daytime'] = daytime

In [93]:
bikes['hours_since_open'] = range(0, bikes.shape[0])

In [94]:
bikes.head(5)

Unnamed: 0,rental_hour,workingday,weather,temp,humidity,windspeed,total_rentals,sunspots,season_1,season_2,season_3,daytime,hours_since_open
0,2011-01-01 00:00:00,0,1,9.84,81,0.0,16,0,0,0,1,0,0
1,2011-01-01 01:00:00,0,1,9.02,80,0.0,40,1,0,0,1,0,1
2,2011-01-01 02:00:00,0,1,9.02,80,0.0,32,1,0,0,1,0,2
3,2011-01-01 03:00:00,0,1,9.84,75,0.0,13,0,0,0,1,0,3
4,2011-01-01 04:00:00,0,1,9.84,75,0.0,1,1,0,0,1,0,4


In [96]:
feature_cols3 = ['workingday', 'weather', 'temp', 'humidity', 'windspeed', 'sunspots', 'season_1', 'season_2', 'season_3', 'daytime', 'hours_since_open']
X3 = bikes[feature_cols3]
y3 = bikes.total_rentals

In [97]:
X3C = sm.add_constant(X3)
results = sm.OLS(y3, X3C).fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:          total_rentals   R-squared:                       0.348
Model:                            OLS   Adj. R-squared:                  0.347
Method:                 Least Squares   F-statistic:                     526.9
Date:                Wed, 20 Nov 2019   Prob (F-statistic):               0.00
Time:                        14:44:05   Log-Likelihood:                -69720.
No. Observations:               10886   AIC:                         1.395e+05
Df Residuals:                   10874   BIC:                         1.396e+05
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const               62.5476     10.227  

In [98]:
feature_cols4 = ['temp', 'humidity', 'season_1', 'season_2', 'daytime', 'hours_since_open']
X4 = bikes[feature_cols4]
y4 = bikes.total_rentals

In [99]:
X4C = sm.add_constant(X4)
results = sm.OLS(y4, X4C).fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:          total_rentals   R-squared:                       0.347
Model:                            OLS   Adj. R-squared:                  0.346
Method:                 Least Squares   F-statistic:                     961.6
Date:                Wed, 20 Nov 2019   Prob (F-statistic):               0.00
Time:                        14:47:06   Log-Likelihood:                -69730.
No. Observations:               10886   AIC:                         1.395e+05
Df Residuals:                   10879   BIC:                         1.395e+05
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const               52.6050      7.258  

### Evaluating Models / Comparing Models
---

We can make linear models now, but how do we select the "best" model to use for our applications? 

These are three different questions:
1.  How well does the model fit the observed data (training set)
2.  How well will the model predict future observations
3.  What is the best model for the business scenario (i.e. taking into account the reward for predicting correctly and the penalty for making an error).

Accuracy on the training set may differ from accuracy when the model goes into production for two reasons
1.  The training set was too small -- important features in the population didn't show up enough to be noticed
2.  The model found spurious signal in the training set "total sales is higher on days that begin with 'T'." (Overfitting)
