# Linear Regression 
## Objectives 
1. Understand the differences between variance, covariance and correlation. 
2. Be able to explain each piece of the linear regression formula including terminology typically associated with regression 
3. Understand the assumptions of regression 
3. Use diabetes dataset to fit a simple linear regression model using _statsmodels package_

## What is linear regression? 
> Regression Analysis is a **parametric** technique meaning a set of parameters are used to **predict** the value of an unknown target variable (or dependent variable)  based on one or more of known input features (or independent variables, predictors), often denoted by _x_  .

## Covariance and Correlation

The idea of _correlation_ is the simple idea that variables often change _together_. For a simple example, cities with more buses tend to have higher populations.

We might observe that, as one variable X increases, so does another Y, OR that as X increases, Y decreases.

The _covariance_ describes how two variables co-vary. Note the similarity in the definition to the definition of ordinary variance:


**Variance**: Measure of dispersion from the mean for continuous random variables. How far a set of numbers are spread out from their overall average value. <br/>
n = # of data points <br/>
$x_i$ = individuals data points <br/>
$mu$ = mean 
$$\sigma^2 = \sum_{i}^{n}\frac{(x_i -\mu )^2}{n}$$

**Covariance**: Measure of how variables differ/relate to one another
$$\sigma_{xy} = \frac{\sum_{i,j}^{n} (x_i -\mu_x )(y_j - \mu_y)}{n}$$

Problem: 
* Positive covariance  --> correlates (together)
* Negative covariance --> correlates inversely

But ranges (-∞,∞) so what is a "stronger" relationship?

So, we need: 

## Correlation
Pearson's Correlation: Normalizes covariance so relationships are now represented on a [-1, 1] scale

$$ r = \frac{\sum_{i,j}^{n}(x_i -\mu_x)(y_j - \mu_y)} {\sqrt{\sum_{i,j}^{n}(x_i - \mu_x)^2 (y_j-\mu_y)^2}}$$

<img src='https://raw.githubusercontent.com/learn-co-students/dsc-0-10-03-cov-corr-online-ds-sp-000/master/images/correlation.png' width=70%/>

In [None]:
#load libraries 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 
plt.style.use('ggplot')

import seaborn as sns

#load the dataset
df = pd.read_csv('data/diabetes.csv')

#Print the first 5 rows of the dataframe.
df.head()

> *Note:* Examples in this lecture use the Pima Indians Diabetes data which was taken from a larger collection of data, originally collected by the National Institute of Diabetes and Digestive Kidney Diseases. It provides diagnostic data on 768 females of Pima Indian heritage aged 21 years or older. The analyses presented in this post are intended to illustrate an approach to linear regression modeling and should not be used to draw substantive conclusions on biomedical pathologies.

In [None]:
df = df.loc[:, ["Age", "Glucose"]]
df.info()

> The Age column provides each participant's age measured in years. Glucose is a measure of each participant's plasma glucose concentration at 2 hours from an oral glucose tolerance test. In the current analyses we will use age to predict participants' plasma glucose concentration.

## Terminology 


### Basics: 
**Independent Variable:** the data we are using to make a prediction. In this case: _Age_. <br/>
AKA - predictor, feature, inputs. 

**Dependent Variable:** the data we are trying to predict. In this case: _glucose_. <br/>
AKA - target, outcome or outputs. 

For our first model:

$$ y = m \cdot x + b $$

Here:

- $x$: input column (just one for now)
- $y$: output column (column we're trying to predict)

Solving for the coefficients $m$(our slope) and $b$(y-intercept)  - based on the line that 'best' represents the relationship between $x$ and $y$, _assuming_ that relationship is a straight line.

**Let's pre-process our data and do a tiny bit of EDA - Remember our crisp-dm method?** 
![](https://www.datascience-pm.com/wp-content/uploads/2021/02/CRISP-DM.png)


In [None]:
df.describe()

In [None]:
#how many rows have 0 for Glucose? 


In [None]:
#let's get rid of zeros using .loc


In [None]:
df.describe()

Ok great, now let's look at the correlation between our two variables

In [None]:
df[["Age", "Glucose"]].corr()

The correlation between age and glucose concentration is 0.27, indicating a slight positive relationship. This means that older participants tend to have higher glucose concentrations relative to younger participants, on average. We can visualize this relationship with a scatterplot to get a better idea of the observed relationship between age and glucose levels.

In [None]:
plt.figure(figsize=(10, 10))
ax = plt.axes()
ax.scatter(df["Age"], df["Glucose"], color='r', alpha=0.20)
ax.set_xlabel('Age')
ax.set_ylabel('Glucose')
plt.show();

In [None]:
sns.lmplot(x='Age', y='Glucose', data=df)
plt.show()

In [None]:
X = df['Age']
y = df['Glucose']

plt.plot(X, y, 'o')
m, b = np.polyfit(X, y, 1)

plt.plot(X, m*X + b);

## Let's build a model using StatsModels 

In [None]:
import statsmodels.api as sm

To ensure that the intercept value from our model lends itself to an interpretation that is consistent with the observed data, it is a good idea to recode age by subtracting the minimum value of 21 from each individual age value. Otherwise, the intercept value will be extrapolated to observations at age zero, which doesn't exist in this dataset. 

In [None]:
df["Age"] = df["Age"] - df["Age"].min()

In [None]:
#instatiate OLS model 
model = sm.OLS(df['Glucose'], df['Age'])
results = model.fit()
print(results.summary())

_Note:_ <br/>
sm.OLS only contains information on the structure of our model. If you run sm in a Python shell or Jupyter notebook cell, you will just get the following output:_<statsmodels.regression.linear_model.OLS at 0x1a27daa990>_ . At this point, we still need to fit the model to our data, which we can do by applying the fit method to sm.OLS. So, let's overwrite smOLS with a fitted model.

### Initial Interpretations of Summary stats
There is a lot going on here, so I only want to focus on a few pieces of output. First, the R-squared has a value of 0.555. 
> First, if we interpret R² as the proportion of variance in the outcome accounted for by our model, this value tells us that our model — with just one independent variable — accounts for roughly 56% of the variance in glucose levels.

> Second, the value of 5.4991 under the coef column in the Age row provides the regression weight for our predictor, age. We can interpret this value to mean the for every year increase in age the predicted glucose concentration increases by a value of roughly 5.50. Thus, a participant whose age is one year above the minimum (i.e., age 22) is expected to have a glucose concentration that is about 5.50 units higher than a participant who is at the minimum age (i.e., age 21).

## How am I able to draw these conclusions? 


### Let's break down the regression line and the associated error 

<img src='https://rasbt.github.io/mlxtend/user_guide/regressor/LinearRegression_files/simple_regression.png' />

A **residual** is the difference between the actual value and the predicted value for a point we tried to predict where we knew the actual correct answer.


$$ \text{Squared Sum of Residuals} = \sum\limits_{i=1}^{n} (y_i - \text{y_pred}_{i})^{2}$$