# Regression Analysis

Thus far we've seen descriptive statistics, filtering, conformitory analysis - all of that comes together in modeling.

In a way we've already been doing statistical modeling, because statistical modeling is merely the sum of the parts we've covered thus far.

Generally speaking statistics, mathematical measures which describe data, can be broken out into a few categories:

* L-estimators
* M-estimators
* "Advanced" estimators

So far we've been looking at L-estimators, these are things like the mean, the median, the standard deviation or the interquartile range.

Each of these measures has a clear and _simple_ mathematical description with a clear and simple intuition for humans.  As an aside, because these estimators are so simple, people can often misinterpret their results, by misunderstanding the underlying data or not ensuring all assumptions of the L-estimator are satisfied.  This is so called, "lying with statistics".

There is nothing new with M-estimators like the one we'll look at here, or the ones we'll look at in the next exercise, except for complexity.  

Our first example of an M-estimator is called Ordinary Least Squares.  It is given it's name because of how the estimator works and how it's used.

With the L-estimators we looked at previously we need only look at a single column.  This is in part because we were learning simple patterns - things like the center or spread of a single column.  

Now, with M-estimators we'll be looking at things like the strength of the relationship between one or more related variables.  Also the description our M-estimator will produce won't be a simple number, instead it will be equation, which is an approximation of the relationships of the underlying data.

If this equation is a reasonable approximation, then a whole set of truths fall out from this equation, and we can leverage all of the relevant pieces of mathematics to inform our analysis.

For instance, with an equation like:

`Fair_amount = 2*Trip_distance + epsilon`

Here epsilon is some small amount of noise which is normally distributed with mean 0 and standard deviation 1.

If the above equation holds true we can make informed decisions about when to take a cab!  But more than that - we know the derivative of this equation, the visual graph of this function and many other facts about this relationship between `Fair_amount` and `Trip_distance`.  Of course, 2 is just a made up coefficient that probably isn't accurate.

Using Ordinary Least Squares, we'll be able to figure out what the real coefficient is.  And we can use that information to inform an analysis of taxi cabs across New York City.

## A First Example

Now that we've got a basic understand of the goals of linear regression, let's see how it works in practice:

In [1]:
from scipy import stats
from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel, f_regression
import math


from scipy.stats import randint as sp_randint
from sklearn import metrics
from prettytable import PrettyTable

# Now that we have multiple variables some sense of summary will be helpful
def summary(X_vars, y_var, model, categorical=False):
    cols = X_vars.columns.tolist()
    # checks to see if the variable is categorical
    if categorical == True:
        lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X_vars, y_var)
        model = SelectFromModel(lsvc, prefit=True)
        labels = [cols[x] for x in model.get_support(indices=True) if x]
        X_new = model.transform(X_vars)
        print("Linear SVC Feature reduction")
        print(cols, "->", labels)
    f_test, t_test_pvals = f_regression(X_vars, y_var)
    t_test = [math.sqrt(elem) for elem in f_test]
    f_statistic, f_prob = stats.f_oneway(*[np.array(X_vars[column]) for column in X_vars.columns])
    return f_statistic, f_prob, t_test, t_test_pvals, model.coef_
    
        
def pretty_print_results(X, f_statistic, f_prob, t_test, t_test_pvals, mi, coef):
    cols = X.columns.tolist()
    t_test = list(t_test)
    t_test_pvals = list(t_test_pvals)
    mi = list(mi)
    coef = list(coef)
    tables = []
    if len(X) > 3:
        for index in range(0, len(X), 3):
            if abs(len(X) - index) < 3:
                tmp_col = cols[index:] 
                tmp_t_test = t_test[index:]
                tmp_t_test_pvals = t_test_pvals[index:]
                tmp_mi = mi[index:]
                tmp_coef = coef[index:]
            else:    
                tmp_col = cols[index:index+3] 
                tmp_t_test = t_test[index:index+3]
                tmp_t_test_pvals = t_test_pvals[index:index+3]
                tmp_mi = mi[index:index+3]
                tmp_coef = coef[index:index+3]
            if tmp_col == []:
                break
            tmp_table = PrettyTable(["test_name"] + tmp_col)    
            tmp_table.add_row(["t-test"] + tmp_t_test)
            tmp_table.add_row(["t-test pvals"] + tmp_t_test_pvals)
            tmp_table.add_row(["mutual info"] + tmp_mi)
            tmp_table.add_row(["coefficient"] + tmp_coef)
            tables.append(tmp_table)
            
    else:
        tmp_table = PrettyTable(["test_name"] + col)    
        tmp_table.add_row(["t-test"] + t_test)
        tmp_table.add_row(["t-test pvals"] + t_test_pvals)
        tmp_table.add_row(["mutual info"] + mi)
        tmp_table.add_row(["coefficient"] + coef)
        tables.append(tmp_table)
    print("F-statistic", f_statistic)
    print("F-prob", f_prob)
    for table in tables:
        print(table)

ModuleNotFoundError: No module named 'prettytable'

In [1]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from scipy
import numpy as np

def generate_data():
    X = np.random.randn(1000, 10)
    y = X*2
    return X, y

linear_regression = LinearRegression()
X, y = generate_data()

linear_regression.fit(X, y)


LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [5]:
np.random.randn(1000, 10)[0]

array([ 0.51201876,  0.47426296, -0.10556103, -1.17890058,  0.77113891,
       -0.42845718,  0.37776175,  0.49112643, -0.41052297, -0.11299925])

In [None]:
from skfeature.function.statistical_based import chi_square
from skfeature.function.information_theoretical_based import CIFE
from skfeature.function.statistical_based import CFS
from skfeature.function.information_theoretical_based import CMIM
from skfeature.function.information_theoretical_based import DISR
from skfeature.function.information_theoretical_based import FCBF
from skfeature.function.information_theoretical_based import ICAP
from skfeature.function.information_theoretical_based import JMI
from skfeature.function.information_theoretical_based import MIFS
from skfeature.function.information_theoretical_based import MIM
from skfeature.function.information_theoretical_based import MRMR
from skfeature.function.similarity_based import SPEC
