# Regression Analysis

Thus far we've seen descriptive statistics, filtering, conformitory analysis - all of that comes together in modeling.

In a way we've already been doing statistical modeling, because statistical modeling is merely the sum of the parts we've covered thus far.

Generally speaking statistics, mathematical measures which describe data, can be broken out into a few categories:

* L-estimators
* M-estimators
* "Advanced" estimators

So far we've been looking at L-estimators, these are things like the mean, the median, the standard deviation or the interquartile range.

Each of these measures has a clear and _simple_ mathematical description with a clear and simple intuition for humans.  As an aside, because these estimators are so simple, people can often misinterpret their results, by misunderstanding the underlying data or not ensuring all assumptions of the L-estimator are satisfied.  This is so called, "lying with statistics".

There is nothing new with M-estimators like the one we'll look at here, or the ones we'll look at in the next exercise, except for complexity.  

Our first example of an M-estimator is called Ordinary Least Squares.  It is given it's name because of how the estimator works and how it's used.

With the L-estimators we looked at previously we need only look at a single column.  This is in part because we were learning simple patterns - things like the center or spread of a single column.  

Now, with M-estimators we'll be looking at things like the strength of the relationship between one or more related variables.  Also the description our M-estimator will produce won't be a simple number, instead it will be equation, which is an approximation of the relationships of the underlying data.

If this equation is a reasonable approximation, then a whole set of truths fall out from this equation, and we can leverage all of the relevant pieces of mathematics to inform our analysis.

For instance, with an equation like:

`Fair_amount = 2*Trip_distance + epsilon`

Here epsilon is some small amount of noise which is normally distributed with mean 0 and standard deviation 1.

If the above equation holds true we can make informed decisions about when to take a cab!  But more than that - we know the derivative of this equation, the visual graph of this function and many other facts about this relationship between `Fair_amount` and `Trip_distance`.  Of course, 2 is just a made up coefficient that probably isn't accurate.

Using Ordinary Least Squares, we'll be able to figure out what the real coefficient is.  And we can use that information to inform an analysis of taxi cabs across New York City.

## A First Example

Now that we've got a basic understand of the goals of linear regression, let's see how it works in practice:

In [31]:
import statsmodels.api as sm


def generate_data():
    X = np.zeros((100, 2))
    for index in range(100):
        X[index] = np.random.normal(0, 1, 2)
    y = X[:,0]*2 + X[:,1]*3
    return X, y

X, y = generate_data()
X = sm.add_constant(X)
model = sm.OLS(y,X)
results = model.fit()
results.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,1.0
Model:,OLS,Adj. R-squared:,1.0
Method:,Least Squares,F-statistic:,2.6509999999999996e+32
Date:,"Tue, 24 Jul 2018",Prob (F-statistic):,0.0
Time:,17:25:15,Log-Likelihood:,3273.0
No. Observations:,100,AIC:,-6540.0
Df Residuals:,97,BIC:,-6532.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-3.331e-16,1.5e-16,-2.215,0.029,-6.31e-16,-3.46e-17
x1,2.0000,1.6e-16,1.25e+16,0.000,2.000,2.000
x2,3.0000,1.53e-16,1.96e+16,0.000,3.000,3.000

0,1,2,3
Omnibus:,2.034,Durbin-Watson:,2.038
Prob(Omnibus):,0.362,Jarque-Bera (JB):,1.838
Skew:,-0.331,Prob(JB):,0.399
Kurtosis:,2.954,Cond. No.,1.11


## Interpretting our Model

Now that we've seen our model, let's interpret it:

## High level point

For linear models the only thing we are measuring is the interplay in variance between the independent and dependent variables.

## Goodness of fit

The first thing we want to look at is the R^2 score.  The R^2 is a measure of "goodness of fit" of our model.  If our model fits our data well, then we have a reasonable level of understand of how our independent variable (y) is effected by our dependent variables (X).  

It's worth noting that R^2 is only an effective measure of fit for linear models.  It is not good for capturing non-linear fit, because of an underlying assumption of R^2.  R^2 captures goodness of fit by looking first assuming linear regression was performed and then looks at the sum of squares variance for the independent and dependent variables.

For interpretability purposes, we can think of R^2 as the variance in X that explains variance in Y.  

So for the above example, 100% of the variance in X explains 100% of the variance in y.  And thereby y is fully explained by X, in a statistical correlation sense.  


## Are any of the variables important?

The next measure usally under consideration is the F-Test which asks, are all the variables taken together statistically significant for the dependent variable (y)?

If the p-value is above 0.05 we fail to reject the null hypothesis, meaning the variables are probably not good predictors of y.  Which means we should choose new variables for predictors.  If the p-value is below 0.05 then we can likely safely say some of the independent variables (X) are statistically related to the independent variable (y).

Here we get a p-value of 0.00 therefore we can conclude all the values are jointly statistically significant.

## Are each of the variables important?

The last question we ask is with respect to individual variable importance.  This is done via the t-test.  This asks the question, is the given variable statistically significant with respect to the independent variable?  

So we'll have one t-test per variable -

As you can see above X1 and X2 are both statistically significant with respect to y because both of their pvalues are 0.00

## Some important caveats about t - tests

t-tests are defined by

`coefficient of variable / variance of variable`

So this means that a t - test can pass for one of a few reasons:

1. the coefficient is very very small
2. the variance is very high
3. Given a reasonable coefficient magnitude, the interplay between the size of the coefficient and the variance of the variable is below 0.05

If we are trying to ensure our tests pass for the right reasons, we want to be in case three.  If we are in case 1 - specifically the coefficient is very small, the t test might pass incorrectly, because the variance of the independent variable may not actually contribute to variance of the dependent variable.


# Feature Engineering

Now that we have a sense of how linear regression works, let's make a new dataset with some features that matter and some features that don't.  Then we'll show how our model tests inform how we decide which independent variables matter for statistical significance.

Then we'll compare this with automated feature engineering techniques.

In [None]:
from skfeature.function.statistical_based import chi_square
from skfeature.function.information_theoretical_based import CIFE
from skfeature.function.statistical_based import CFS
from skfeature.function.information_theoretical_based import CMIM
from skfeature.function.information_theoretical_based import DISR
from skfeature.function.information_theoretical_based import FCBF
from skfeature.function.information_theoretical_based import ICAP
from skfeature.function.information_theoretical_based import JMI
from skfeature.function.information_theoretical_based import MIFS
from skfeature.function.information_theoretical_based import MIM
from skfeature.function.information_theoretical_based import MRMR
from skfeature.function.similarity_based import SPEC
