# Linear Regression Part 2


## Topics:
* Recap
* Statsmodels
* Adjusted R^2


* Categorical Variables
    * Why k-1 variables?
  
* Interaction Terms
* Polynomial Variables

* What about Many levels per categorical variable?


## Recap

Recall that linear regression attempts to fit a response variable $Y$ to a linear combination of variables $X$, and their associated coefficients, $\beta$.

While we know the data points for $Y$ and $X$, we must estimate $\beta$. The OLS (Ordinary Least Squares) estimate for $\beta$ is:


$\hat{\beta} = {(X'X)}^{-1}X'y$

Once we have estimates for $\beta$, we can perform:
* Prediction (Predictions are on data points encoded as $X$)
* Weight Analysis (How much does each factor contribute?)


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb

from sklearn import datasets
from sklearn import linear_model
from patsy import dmatrices

%matplotlib inline

In [None]:
data = pd.read_csv("http://data.princeton.edu/wws509/datasets/salary.dat", sep='\s+')
data.head()

In [None]:
# First, lets try using the sklearn linear model
X = data[ ['yr','yd'] ]
y = data[ 'sl' ]

In [None]:
# Create model object. We'll explicitly include an intercept term
model = linear_model.LinearRegression(fit_intercept=True) 

In [None]:
# Fit the beta coefficients using X and y
model.fit(X,y)

In [None]:
# Prediction
y_pred = model.predict(X)

In [None]:
# Coefficients
print model.intercept_, model.coef_

In [None]:
# Get R^2 score once you have fitted the model
model.score(X,y)

In [None]:
# Can also calculate the R^2 manually
# R^2 = 1 - SSE/SST
#   where SSE is the square of the sum of errors
#   and SST is the square of the sum of deviations from the mean
sse = np.power(y_pred - y,2).sum()
sst = np.power(y - y.mean(),2).sum()
score = 1.0 - (sse/sst)
print score

In [None]:
# Model form
print "sl ~ ", str(model.intercept_),"+ ",model.coef_[0],"* yr +",model.coef_[1],"* yd"

In [None]:
# Single train/test split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.8)

model = linear_model.LinearRegression(fit_intercept=True)
model.fit(X_train,y_train)
print "Train Score:", model.score(X_train,y_train)
print "Test Score:", model.score(X_test,y_test)

In [None]:
# Running once gives us a lot of variance in train/test, let's try cross validation
from sklearn.cross_validation import cross_val_score
from sklearn.cross_validation import KFold
model = linear_model.LinearRegression(fit_intercept=True)

# Instead of cv=5, or cv=10, I use KFold which is an explicit iterator for the data.
# Normally cross_val_score does not shuffle by default. KFold allows shuffling of data
cv_scores = cross_val_score(model,X,y,cv=KFold(len(X),n_folds=10,shuffle=True))

print "Avg CV Score:", cv_scores.mean()

## Statsmodels

- Statsmodels is a relatively new package that provides convenient utilities for investigating the results of a model. 
- It uses `patsy` to provide R formula syntax

A formula allows you to write a functional relationship between variables.  
Example:

```R
Y ~ X1 + X2 + X3
```

The formulas automatically includes an intercept term. You can make this explicit by using 

```R
Y ~ 1 + X1 + X2 + X3
```

http://statsmodels.sourceforge.net

In [None]:
import statsmodels.formula.api as sm

In [None]:
# Formula as above. Fit salary on an intercept, yr, and yd
model = sm.ols(formula="sl ~ yr + yd", data=data).fit() # Call fit on the ols model

In [None]:
model.summary()

# Categorical Variables

Using statsmodels, we can encode the 'formula' directly into the model, instead of manually creating design matrices through patsy. In fact, statsmodels uses patsy under the hood to make the process easy.

* [Patsy](https://patsy.readthedocs.org/en/latest/)
* [Statsmodels](http://statsmodels.sourceforge.net)

In [None]:
# Can add categorical variables easily!
model = sm.ols(formula="sl ~ yr + yd + rk", data=data).fit() # Call fit on the ols model
model.summary()


-----
**Question: Why doesn't patsy/statsmodels include rk[T.assistant] as well?**

If we add **rk[T.assisstant]** to the model, this creates a rank deficient matrix, and causes a singularity when computing ${(X'X)}^{-1}$. This is called the [Dummy Variable Trap](http://www.algosome.com/articles/dummy-variable-trap-regression.html).

In short, if you did add rk[T.assisstant] as a factor to your matrix, the column sums of rk[T.full] + rk[T.associate] + rk[T.assistant] = 1 for every row. The column sums == the intercept!

-----
** Question: Then, how do I interpret the above coefficients? **

You can look it this way. When rk[T.associate]==0 and rk[T.full]==0, then resulting salary from the linear regression is when the rank is assistant. How? The **intercept**!

```
sl ~ Intercept + Rk[Associate] + Rk[Full] + Yr + Yd
```

If either rk[T.associate] == 1 OR rk[T.full] == 1, you can look at their respective beta coefficients as the marginal contribution to overall salary. The intercept encodes the case when rank == Assistant

-----

## Additional Factors

```
sm.ols(formula="sl ~ yr + yd + rk", data=data)
```

As you can see in the above examplaes, `+` is not acting as an addition operator but as a separator between other variables.

There are other operators that lose their algebraic meaning in a formula. 

- `+` adds a new variable.
- `:` adds the _interaction_ of two variables. 
- `*` adds the original terms as well as their interaction effect.
- `C(yr)` adds a set of terms where yr is converted to dummy variables
- `np.power(x,2)` adds a term where x is raised to the second power

[Formula types](https://patsy.readthedocs.org/en/latest/formulas.html#the-formula-language)

In [None]:
# Let's try one of these interaction terms
# Adds yr, rk, yr*rk (and their combinations)
model = sm.ols(formula="sl ~ yr*rk", data=data).fit()
model.summary() 

## Analysis

How do I know whether the current model I have is bad or not?

0. $R^2$/ Adjusted $R^2$
1. Correlation matrix
2. T Stat/P Values
3. F statistic
4. Skew/Kurtosis/JB Test
5. In-sample/out-of-sample predictions


- Adjusted $R^2$ takes into account the number of additional factors you have and penalizes for that (i.e. you can't just keep adding factors until you get a perfect solution)
- Correlation matrix of the factors helps you apriori find collinear factors
- T-stat/P-values of beta coefficients help diagnose when factors are not contributing
- Skew/Kurtosis/JB Test are somewhat second-order stats and help diagnose whether the errors are truly normally distributed
- In-Sample/OOS predictions are for the most part the way to go

In [None]:
# Let's go over the analysis terms
model = sm.ols(formula="sl ~ yr*rk", data=data).fit()
model.summary()

In [None]:
# Test 1 -- Correlations of Factors, going to use Patsy here for automatic column names
y,X = dmatrices("sl ~ yr*rk",data=data,return_type="dataframe")
X.corr()

The **yr:rk[T.full]** factor is pretty correlated with **rk[T.full]** as well as **yr**. 

This makes sense as the interaction variable is just yr * "Is rank full". 

** Question: What can happen when there are collinear(similar) variables in a model?**
** Answer: It depends **
* If the level of collinearity is low, the predictions may not change too much, but the beta values will not make too much sense--hard to get an intuition
* If the degree of collinearity is high however, the predictions may change drastically based on small changes in the input. Chalkboard example to follow.


------
# Polynomial Terms

We can also add polynomial (i.e. X^2, X^4, etc.) terms as factors. If there is some nonlinearity that we want to capture, but still use a linear model, we can add terms as polynomials.

Let's take a look at how much the model predictions change based on how many polynomial terms we add.

In [None]:
# Create the data points we're going to use first
data = pd.DataFrame({'x': np.linspace(0,1,8),
                     'y': [1.0,1.4,1.6,1.4,1.7,1.45,1.7,2.0] })
plt.scatter(data.x,data.y)

In [None]:
# Fit a model of y ~ Intercept + x
model = sm.ols("y ~ x",data=data).fit()
model.summary()
x_new = np.linspace(0,1,100)
y_new = model.params[0] + model.params[1] * x_new
plt.scatter(data.x,data.y)
plt.plot(x_new,y_new)

In [None]:
# Fit a model of y ~ Intercept + x + x^2
model = sm.ols("y ~ x + np.power(x,2) ",data=data).fit()
x_new = np.linspace(0,1,100)
y_new = model.params[0] + model.params[1] * x_new + model.params[2] * np.power(x_new,2)
plt.scatter(data.x,data.y)
plt.plot(x_new,y_new)

In [None]:
# Fit a model of y ~ Intercept + x + x^2 + x^3
model = sm.ols("y ~ x + np.power(x,2) + np.power(x,3)",data=data).fit()
x_new = np.linspace(0,1,100)

y_new = model.params[0] + \
    model.params[1] * x_new + \
    model.params[2] * np.power(x_new,2) + \
    model.params[3] * np.power(x_new,3)

plt.scatter(data.x,data.y)
plt.plot(x_new,y_new)

In [None]:
# Fit a model of y ~ Intercept + x + x^2 + x^3 + x^4 + x^5 + x^6 + x^7 + ...
model = sm.ols("y ~ x + np.power(x,2) + np.power(x,3) + np.power(x,4) + np.power(x,5) + np.power(x,6) + np.power(x,7) + np.power(x,8) + np.power(x,9)",data=data).fit()
x_new = np.linspace(0,1,100)

y_new = model.params[0] + \
    model.params[1] * x_new + \
    model.params[2] * np.power(x_new,2) + \
    model.params[3] * np.power(x_new,3) + \
    model.params[4] * np.power(x_new,4) + \
    model.params[5] * np.power(x_new,5) + \
    model.params[6] * np.power(x_new,6) + \
    model.params[7] * np.power(x_new,7) + \
    model.params[8] * np.power(x_new,8) + \
    model.params[9] * np.power(x_new,9)

plt.scatter(data.x,data.y)
plt.plot(x_new,y_new)

Our predictions for a 9 term polynomial based approach fits perfectly! This is bad--i.e. for our given data points we've fit a perfect line, but soon as a new datapoint comes in, the prediction is woefully innacurate.

* For instance--what about a prediction at x = .9?
* Also, what about the correlation of the data for X, X^2, X^3, ..., X^9?

In [None]:
# All the data is extremely correlated
y, X = dmatrices("y ~ x + np.power(x,2) + np.power(x,3) + np.power(x,4) + np.power(x,5) + np.power(x,6) + np.power(x,7) + np.power(x,8) + np.power(x,9)",data=data,return_type="dataframe")
X.corr()

Oh no! The factors are extremely correlated!!

This means the beta values/coefficients could change a lot depending on the data that we pass in to train the model.

End result? The beta coefficients could have a high variance -> the predictions could have a high variance

# Summary

So far, we've observed a number of cases where we can run into issues with linear regression--all related to overfitting.

* We want to use too many factors for how many data we have
* Some of the categories have many levels. An example of this could be if there were a categorical variable for "US State", of which there would be 50 levels for each US state. We still want to code the effects of the states, but there's not enough data per state.
* We want to use similar factors as we believe there's some predictive power, but are afraid that they are all multicollinear
* We want to use polynomial terms, but are afraid that they will lead to overfitting


Let's explore regularization to solve many of these issues