## Starting off

Define write down a definition for the following two terms:

Bias - 

Variance - 

# Model Evaluation

Agenda:
- R^2 
- Bias versus Variance
- Train Test Split
- Model Evaluation


##  Coefficient of Determination ($R^2$)

The _coefficient of determination_, is a measure of how well the model fits the data.

It is a statistic used in the context of statistical models whose main purpose is either the prediction of future outcomes or the testing of hypotheses, on the basis of other related information. It provides a measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model

$R^2$ for a model is ultimately a _relational_ notion. It's a measure of goodness of fit _relative_ to a (bad) baseline model. This bad baseline model is simply the horizontal line $y = \mu_Y$, for dependent variable $Y$.


$$\text{TSS }= \text{ESS} + \text{RSS }$$

- TSS or SST = Total Sum of Squares 
- ESS or SSE = Explained Sum of Squares
- RSS or SSR = Residual Sum of Squares

The actual calculation of $R^2$ is: <br/> $$\Large R^2= \frac{\Sigma_i(\bar{y} - \hat{y}_i)^2}{\Sigma_i(y_i - \bar{y})^2}=1- \frac{\Sigma_i(y_i - \hat{y}_i)^2}{\Sigma_i(y_i - \bar{y})^2}$$.

$R^2$ takes values between 0 and 1.

$R^2$ is a measure of how much variation in the dependent variable your model explains.


<img src='https://pbs.twimg.com/media/D-Gu7E0WsAANhLY.png' width ="700">

In [1]:
# build a simple linear regression in python
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('seaborn')
sns.set(style="white")

In [2]:
#read in car data
df = sns.load_dataset('mpg')

###  Applying R^2

Let's look at the r^2 value of our SLR model for cars

In [3]:
# building a linear regression model using statsmodel 
from statsmodels.formula.api import ols

lr_model = ols(formula='mpg~weight', data=df).fit()

lr_model.summary()

0,1,2,3
Dep. Variable:,mpg,R-squared:,0.692
Model:,OLS,Adj. R-squared:,0.691
Method:,Least Squares,F-statistic:,888.9
Date:,"Tue, 28 Apr 2020",Prob (F-statistic):,2.97e-103
Time:,16:24:13,Log-Likelihood:,-1148.4
No. Observations:,398,AIC:,2301.0
Df Residuals:,396,BIC:,2309.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,46.3174,0.795,58.243,0.000,44.754,47.881
weight,-0.0077,0.000,-29.814,0.000,-0.008,-0.007

0,1,2,3
Omnibus:,40.423,Durbin-Watson:,0.797
Prob(Omnibus):,0.0,Jarque-Bera (JB):,56.695
Skew:,0.713,Prob(JB):,4.89e-13
Kurtosis:,4.176,Cond. No.,11300.0


What happens to the R^2 when we add more variables?

In [4]:
mlr_model = ols(formula='mpg~weight+horsepower+displacement+cylinders+acceleration', data=df).fit()
mlr_model.summary()

0,1,2,3
Dep. Variable:,mpg,R-squared:,0.708
Model:,OLS,Adj. R-squared:,0.704
Method:,Least Squares,F-statistic:,186.9
Date:,"Tue, 28 Apr 2020",Prob (F-statistic):,9.82e-101
Time:,16:24:13,Log-Likelihood:,-1120.1
No. Observations:,392,AIC:,2252.0
Df Residuals:,386,BIC:,2276.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,46.2643,2.669,17.331,0.000,41.016,51.513
weight,-0.0052,0.001,-6.351,0.000,-0.007,-0.004
horsepower,-0.0453,0.017,-2.716,0.007,-0.078,-0.012
displacement,-8.313e-05,0.009,-0.009,0.993,-0.018,0.018
cylinders,-0.3979,0.411,-0.969,0.333,-1.205,0.409
acceleration,-0.0291,0.126,-0.231,0.817,-0.276,0.218

0,1,2,3
Omnibus:,38.561,Durbin-Watson:,0.865
Prob(Omnibus):,0.0,Jarque-Bera (JB):,52.737
Skew:,0.706,Prob(JB):,3.53e-12
Kurtosis:,4.111,Cond. No.,38700.0


Compare it to a model without displacement

In [5]:
mlr_model = ols(formula='mpg~weight+horsepower+cylinders+acceleration', data=df).fit()
mlr_model.summary()

0,1,2,3
Dep. Variable:,mpg,R-squared:,0.708
Model:,OLS,Adj. R-squared:,0.705
Method:,Least Squares,F-statistic:,234.2
Date:,"Tue, 28 Apr 2020",Prob (F-statistic):,6.02e-102
Time:,16:24:13,Log-Likelihood:,-1120.1
No. Observations:,392,AIC:,2250.0
Df Residuals:,387,BIC:,2270.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,46.2740,2.448,18.902,0.000,41.461,51.087
weight,-0.0052,0.001,-7.070,0.000,-0.007,-0.004
horsepower,-0.0453,0.016,-2.820,0.005,-0.077,-0.014
cylinders,-0.4005,0.303,-1.321,0.187,-0.997,0.196
acceleration,-0.0290,0.125,-0.232,0.817,-0.275,0.217

0,1,2,3
Omnibus:,38.54,Durbin-Watson:,0.865
Prob(Omnibus):,0.0,Jarque-Bera (JB):,52.705
Skew:,0.706,Prob(JB):,3.59e-12
Kurtosis:,4.111,Cond. No.,35500.0


## What Is the Adjusted R-squared?

The adjusted R-squared compares the explanatory power of regression models that contain different numbers of predictors.

Suppose you compare a five-predictor model with a higher R-squared to a one-predictor model. Does the five predictor model have a higher R-squared because it’s better? Or is the R-squared higher because it has more predictors? Simply compare the adjusted R-squared values to find out!

$$Adjusted R^2=1-\left(\frac{n-1}{n-p}\right)(1-R^2)$$

Where:

n = sample size   

p  = the number of independent variables in the regression equation


- The adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. 

- The adjusted R-squared increases only if the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected by chance. 

- It is always lower than the R-squared.

## Probabilistic Model Selection
Probabilistic model selection (or “information criteria”) provides an analytical technique for scoring and choosing among candidate models.

Models are scored both on their performance on the training dataset and based on the complexity of the model.

- **Model Performance:** How well a candidate model has performed on the training dataset.
- **Model Complexity:** How complicated the trained candidate model is after training.

Model performance may be evaluated using a probabilistic framework, such as log-likelihood under the framework of maximum likelihood estimation. Model complexity may be evaluated as the number of degrees of freedom or parameters in the model.

### Akaike Information Criterion vs. Bayesian Information Criterion

The model with the lower AIC or BIC should be selected. 

Despite various subtle theoretical differences, their only difference in practice is the size of the penalty; BIC penalizes model complexity more heavily.

Compared to the BIC method (below), the AIC statistic penalizes complex models less, meaning that it may put more emphasis on model performance on the training dataset, and, in turn, select more complex models.

A downside of BIC is that for smaller, less representative training datasets, it is more likely to choose models that are too simple.

https://machinelearningmastery.com/probabilistic-model-selection-measures/

https://www.methodology.psu.edu/resources/AIC-vs-BIC/

## The Machine Learning Process

1. Look at the big picture. 
2. Get the data. 
3. Discover and visualize the data to gain insights. 
4. Prepare the data for Machine Learning algorithms. 
5. Select a model and train it. 
6. Fine-tune your model. 
7. Present your solution. 
8. Launch, monitor, and maintain your system.


<img src='https://www.kdnuggets.com/wp-content/uploads/crisp-dm-4-problems-fig1.png' width ="400">

**A proper machine learning workflow includes:**

* Separate training and test sets
* Trying appropriate algorithms (No Free Lunch)
* Fitting model parameters
* Tuning impactful hyperparameters
* Proper performance metrics
* Systematic cross-validation

# Prediction Evaluation

## Bias - Variance 

There are 3 types of prediction error: bias, variance, and irreducible error.


**Total Error = Bias + Variance + Irreducible Error**

### The Bias-Variance Tradeoff


**Let's do a thought experiment:**

1. Imagine you've collected 5 different training sets for the same problem.
2. Now imagine using one algorithm to train 5 models, one for each of your training sets.
3. Bias vs. variance refers to the accuracy vs. consistency of the models trained by your algorithm.

<img src='Bias-vs.-Variance-v5-2-darts.png' width=500 />

**High bias** algorithms tend to be less complex, with simple or rigid underlying structure.

+ They train models that are consistent, but inaccurate on average.
+ These include linear or parametric algorithms such as regression and naive Bayes.

On the other hand, **high variance** algorithms tend to be more complex, with flexible underlying structure.

+ They train models that are accurate on average, but inconsistent.
+ These include non-linear or non-parametric algorithms such as decision trees and nearest neighbors.

### Bias-Variance Tradeoff

This tradeoff in complexity is why there's a tradeoff in bias and variance - an algorithm cannot simultaneously be more complex and less complex.

**Total Error = Bias^2 + Variance + Irreducible Error**


<img src='Bias-vs.-Variance-v4-chart.png' width=500 />

### Error from Bias

**Bias** is the difference between your model's expected predictions and the true values.

<img src='noisy-sine-linear.png' width=500 />

### Error from Variance

**Variance** refers to your algorithm's sensitivity to specific sets of training data.



<img src='noisy-sine-decision-tree.png' width=500/>

Which one is overfit and which one is underfit?

We want to try to find the proper balance of variance and bias

<img src='noisy-sine-third-order-polynomial.png' width=500 />


# Train Test Split

**How do we know if our model is overfitting or underfitting?**



If our model is not performing well on the training  data, we are probably underfitting it.  


To know if our  model is overfitting the data, we need  to test our model on unseen data. 
We then measure our performance on the unseen data. 

If the model performs way worse on the  unseen data, it is probably  overfitting the data.

The previous module introduced the idea of dividing your data set into two subsets:

* **training set** —a subset to train a model.
* **test set**—a subset to test the trained model.

You could imagine slicing the single data set as follows:

<img src='testtrainsplit.png' width =550 />

**Never train on test data.** If you are seeing surprisingly good results on your evaluation metrics, it might be a sign that you are accidentally training on the test set. 



<img src='https://developers.google.com/machine-learning/crash-course/images/WorkflowWithTestSet.svg' width=500/>

## Model Evaluation Metrics for Regression

**Mean Absolute Error** (MAE) is the mean of the absolute value of the errors:



![alt text](mae.png)

**Mean Squared Error** (MSE) is the mean of the squared errors:

![alt text](mse.png)

**Root Mean Squared Error (RMSE)** is the square root of the mean of the squared errors:



![alt text](rmse.png)

MSE is more popular than MAE because MSE "punishes" larger errors. 

But, RMSE is even more popular than MSE because RMSE is interpretable in the "y" units.

Additionally, I like to divide the RMSE by the standard deviation to  convert it to something similiar to a Z-Score.

# Practicum

In [6]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib
import seaborn as sns

ImportError: Python is not installed as a framework. The Mac OS X backend will not be able to function correctly if Python is not installed as a framework. See the Python documentation for more information on installing Python as a framework on Mac OS X. Please either reinstall Python as a framework, or try one of the other backends. If you are using (Ana)Conda please install python.app and replace the use of 'python' with 'pythonw'. See 'Working with Matplotlib on OSX' in the Matplotlib FAQ for more information.

### Read in cleaned movie data set


In [None]:
movie_df = pd.read_csv('cleaned_movie_data.csv', index_col=0)

### Take a look at the data

In [None]:
movie_df.head()

In [None]:
movie_df.describe()

## Feature exploration¶

In this section I will be investigating different features by plotting them to determine the relationship to SalePrice.

In [None]:
#missing data
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)

### Identify my features and target variable

In [None]:
target = movie_df.gross

features = movie_df[['____', '____', '____', '____', '____', '____']]

In [None]:
features

In [None]:
features.columns

### Create Train and Test Split

The random state variable makes it so you can always have the same 'random' split

In [None]:
#improt train_test_split from sklearn package
from sklearn.model_selection import train_test_split

#call train_test_split on the data and capture the results
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=34,test_size=0.2)

#check the shape of the results
print("Training set - Features: ", X_train.shape, "Target: ", y_train.shape)
print("Training set - Features: ", X_test.shape, "Target: ",y_test.shape)


In [None]:
# fit a model
from sklearn import linear_model

#instantiate a linear regression object
lm = linear_model.LinearRegression()

#fit the linear regression to the data
lm = lm.fit(X_train, y_train)


print(lm.intercept_)
print(lm.coef_)

In [None]:
lm.coef_

### How well did our model perform

Previously we have looked at the R^2 of the model  to  determine  how good of a model this is.  

In [None]:
print ("R^2 Score:", lm.score(X_train, y_train))


In [None]:
y_train_pred = lm.predict(X_train)

In [None]:
#import the metrics module from sklearn
from sklearn import metrics

train_mae = metrics.mean_absolute_error(y_train, y_train_pred)
train_mse = metrics.mean_squared_error(y_train, y_train_pred)
train_rmse = np.sqrt(metrics.mean_squared_error(y_train, y_train_pred))


print('Mean Absolute Error:', train_mae )
print('Mean Squared Error:',  train_mse)
print('Root Mean Squared Error:' , train_rmse)

In [None]:
price_std = target.std()

print('Mean Absolute Error:', train_mae/price_std )
print('Root Mean Squared Error:' , train_rmse/price_std)

### Predicting the Test Set

In [None]:
y_pred = lm.predict(X_test)

In [None]:
## The line / model
plt.scatter(y_test, y_pred)
plt.xlabel("True Values")
plt.ylabel("Predictions")

In [None]:
sns.residplot(y_pred, y_test, lowess=True, color="g")

In [None]:
print ("Score:", lm.score(X_test, y_test))


In [None]:
test_mae = metrics.mean_absolute_error(y_test, y_pred)
test_rmse = np.sqrt(metrics.mean_squared_error(y_test, y_pred))


print('Mean Absolute Error:' + str(metrics.mean_absolute_error(y_test, y_pred)))
print('Mean Squared Error:' + str(metrics.mean_squared_error(y_test, y_pred)))
print('Root Mean Squared Error:' + str(np.sqrt(metrics.mean_squared_error(y_test, y_pred))))

In [None]:
print('Mean Absolute Error  Z:', test_mae/price_std )
print('Root Mean Squared Error Z:' , test_rmse/price_std)

### Comparing our Model's performance on training data versus test data.

In [None]:
print('Training: ', int(train_rmse), "vs. Testing: ", int(test_rmse))