In [2]:
%load_ext autoreload
%autoreload 2
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error as MAE, mean_absolute_percentage_error as MAPE, mean_squared_error as MSE 

import sys
sys.path.append('..')
from src.data import load_data
from src.metric import regressionSummary,adjusted_r2_score
pd.set_option('precision',4)


# Prediction and Estimation
Estimation is about finding a model that explains existing data as well as possible, while predictive models are best when they can determine unknown values given the independent factors.

## Evaluating Predictive Performance
Evaluating performance of predictive or estimation models which use continuous targets uses a different technique than quantifying error rates.  Predictions are not simply correct or incorrect, but we can decide how close we were in our prediction or how far away.  A few techniques to determine performance of continuous target predictions include: 

* Mean Error - the average difference between the predicted and expected target
  * Valuable if the direction of the prediction is important (too high or too low) though it can be misleading since too-high and too-low can cancel each other out.
* MAE - Mean absolute error - the average difference between the predicted and expected target not accounting for sign 
(+/-)
  * takes the too high or too low issue out of play
* MPE - Mean percentage error - average percentage difference between predicted and expected target
* MAPE - Mean absolute percentage error - average percentage difference between predicted and expected target (using absolute values rather than positive or negative error)
* MSE - Mean square error - average of the square of the error of each prediction
  * penalizes significant outliers and removes units so it is a relative metric only
* RMSE - Root mean square error - square root of the average of the square of the error of each prediction (provides relative units)
  * still significantly penalizes bigger errors but also provides units of the base metric


## Multiple Linear Regression
Linear regression models are a great model to explain a linear relationship between predictors and target variables (assuming a linear relationship exists).  They are easy to explain and have a built-in metric to identify the importance of each predictor.  Also, we can use this information to include just enough of the variables to accurately describe the target possibly reducing the complexity

With linear regression we are attempting to describe a target variable in terms of a set of coefficients in a linear equation such as:
$$Y = \beta_0+\beta_1X_1+\beta_2X_2 + ... \beta_nX_n + \epsilon$$
where $\beta$ represents coefficients and $X$ are the predictive factors (independent variables) and $\epsilon$ is _noise_ or _unexplainable_ part.  Data is used to estimate the coefficients and quantify the noise.