# Lesson I

## Limits of simple Regression

In this chapter we'll get farther into regression, including multiple regression and logistic regression. But first let's understand the limits of simple regression.

In a previous exercise, we made a scatter plot of vegetable consumption as a function of income, and plotted a line of best fit.

Here's what it looks like:

<img src='pictures/incomevegetables.jpg' />

The slope of the line is **0.07**, which means that the difference between the lowest and highest income brackets is about **0.49** servings per day.

It was an arbitrary choice to plot vegetables as a function of income. We could've plotted it the other way around, like this:

<img src='pictures/vegetablesincome.jpg' />

The slope of this line is **0.23**, which means that the difference between 0 and 8 servings per day is about 2 income codes, roughly from 5 to 7. If we check the codebook, income of 5 is : $30,000 per year and code 7 is : $65,000. 

So if we use vegetable consumption to predict income, we see a big difference. But when we used income to predict vegetable consumption, we saw a small difference.

This example shows that ***regression is not symmetric***; the regression of A onto B is not same as the regression of B onto A.

We can see that more clearly by putting two figures side by side and plotting both regression lines on both figures:

<img src='pictures/notsymmetric.jpg' />

* On the Left, we treat income as a known quantity and vegetable consumption as random.
* On the right, vegetable consumption is known and income is random.

This example meant to demonstrate another point, which is that ***regression doesn't tell you much about cousation.***

If you think people with lower income can't afford vegetables, we might look at the figure on the left and conclude that it doesn't make much difference.

If you think better diet increases income, the figure on the right might make you think it does.

But in general, regression can't tell you what causes what.

However, we have tools for teasing apart relationships amon multiple variables; one of the most important is ***multiple regression***. ``Scipy`` *doesn't* do multiple regression, so we have to use a different library, ``StatsModels``

Here's the import statement and how to use it:

```python
    import statsmodels.formula.api as smf
    results = smf.ols('INCOME2 ~ _VEGESU1', data=brfss).fit()
    results.params

    # Output:
    '''
    Intercept   5.399903
    _VEGESU1    0.232515
    dtype: float64
    '''
```

``ols`` stants for : **"Ordinary least Squares"**, another name for regression.

* First argument: formula string that specifies that we want to regress income asa function of vegetable consumption.
* Second argument: BRFSS DataFrame

The result from ``ols()`` represents the model; we have to run ``.fit()`` to get the results.

The results object containt a lot of information, but first thing we'll look at the ``.params``, Which contains;
* Estimated slope and intercept

## Exercise

### Using StatsModels

Let's run the same regression using ``SciPy`` and ``StatsModels`` and confirm we get the same results.

In [2]:
# Import packages
import pandas as pd
import numpy as np
from scipy.stats import linregress
import statsmodels.formula.api as smf

# BRFSS DataFRame
brfss = pd.read_hdf('datasets/brfss.hdf5', 'brfss')

In [3]:
# Run regression with linregress
subset = brfss.dropna(subset=['INCOME2', '_VEGESU1'])
xs = subset['INCOME2']
ys = subset['_VEGESU1']
res = linregress(xs, ys)
print(res)

# Run Regression with StatsModels
results = smf.ols('_VEGESU1 ~ INCOME2', data=brfss).fit()
print(results.params)

LinregressResult(slope=0.06988048092105006, intercept=1.5287786243363113, rvalue=0.11967005884864092, pvalue=1.378503916249654e-238, stderr=0.0021109763563323305, intercept_stderr=0.013196467544093591)
Intercept    1.528779
INCOME2      0.069880
dtype: float64
