## Exercises

Create a new notebook or Python script named evaluate.

### Evaluating simple linear regressions on lemonade data with other features:

1. Create a dataframe from the csv at https://gist.githubusercontent.com/ryanorsinger/c303a90050d3192773288f7eea97b708/raw/536533b90bb2bf41cea27a2c96a63347cde082a6/lemonade.csv
2. Make a baseline for predicting sales. (The mean is a good baseline)
3. Create a new dataframe to hold residuals.
4. Calculate the baseline residuals.
5. Use ols from statsmodels to create a simple linear regression (1 independent variable, 1 dependent variable) to predict sales using flyers.


1. Use the .predict method from ols to produce all of our predictions. Add these predictions to the data
2. Calculate that model's residuals.
3. Evaluate that model's performance and answer if the model is significant.
4. Evaluate that model's performance and answer if the feature is significant.

In [39]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from pydataset import data
from statsmodels.formula.api import ols
from sklearn.metrics import mean_squared_error
from math import sqrt

**1. Create a dataframe from the csv**

In [16]:
df = pd.read_csv('https://gist.githubusercontent.com/ryanorsinger/c303a90050d3192773288f7eea97b708/raw/536533b90bb2bf41cea27a2c96a63347cde082a6/lemonade.csv')
df.head()

Unnamed: 0,Date,Day,Temperature,Rainfall,Flyers,Price,Sales
0,1/1/17,Sunday,27.0,2.0,15,0.5,10
1,1/2/17,Monday,28.9,1.33,15,0.5,13
2,1/3/17,Tuesday,34.5,1.33,27,0.5,15
3,1/4/17,Wednesday,44.1,1.05,28,0.5,17
4,1/5/17,Thursday,42.4,1.0,33,0.5,18


**2. Make a baseline for predicting sales. (The mean is a good baseline)**

In [17]:
baseline = df['Sales'].mean()
print('Baseline:', baseline)

Baseline: 25.323287671232876


**3. Create a new dataframe to hold residuals.**

In [21]:
# create an evaluate dataframe
evaluate = pd.DataFrame()

evaluate["x"] = df.Flyers
evaluate["y"] = df.Sales
evaluate["baseline"] = baseline

evaluate.head()

Unnamed: 0,x,y,baseline
0,15,10,25.323288
1,15,13,25.323288
2,27,15,25.323288
3,28,17,25.323288
4,33,18,25.323288


**4. Calculate the baseline residuals.**

In [28]:
evaluate['baseline_residuals'] = evaluate.baseline - evaluate.y

evaluate.head()

Unnamed: 0,x,y,baseline,residuals,yhat,baseline_residuals
0,15,10,25.323288,15.323288,14.673754,15.323288
1,15,13,25.323288,12.323288,14.673754,12.323288
2,27,15,25.323288,10.323288,19.727926,10.323288
3,28,17,25.323288,8.323288,20.149107,8.323288
4,33,18,25.323288,7.323288,22.255013,7.323288


**5. Use ols from statsmodels to create a simple linear regression (1 independent variable, 1 dependent variable) to predict sales using flyers.**

In [33]:
# ols("y ~ x") 
# ols("target ~ feature")
# the df variable is lemonade data

# get predictions

model = ols('Sales ~ Flyers', data=df).fit()
model

<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x106e2db50>

**1. Use the .predict method from ols to produce all of our predictions. Add these predictions to the data**

In [None]:
predictions = model.predict()
evaluate["yhat"] = predictions

evaluate.head()

**2. Calculate that model's residuals.**

In [36]:
# calculate model residuals

evaluate["model_residuals"] = evaluate.yhat - evaluate.y
evaluate.head()

Unnamed: 0,x,y,baseline,residuals,yhat,baseline_residuals,model_residual,model_residuals
0,15,10,25.323288,15.323288,14.673754,15.323288,4.673754,4.673754
1,15,13,25.323288,12.323288,14.673754,12.323288,1.673754,1.673754
2,27,15,25.323288,10.323288,19.727926,10.323288,4.727926,4.727926
3,28,17,25.323288,8.323288,20.149107,8.323288,3.149107,3.149107
4,33,18,25.323288,7.323288,22.255013,7.323288,4.255013,4.255013


**3. Evaluate that model's performance and answer if the model is significant.**

In [38]:
# Calculate if the model beats the baseline

baseline_sse = (evaluate.baseline_residuals**2).sum()
model_sse = (evaluate.model_residuals**2).sum()

if model_sse < baseline_sse:
    print("Our model beats the baseline.")
    print("It makes sense to evaluate this model more deeply.")
else:
    print("Our baseline is better than the model.")

print("Baseline SSE", baseline_sse)
print("Model SSE", model_sse)

Our model beats the baseline.
It makes sense to evaluate this model more deeply.
Baseline SSE 17297.85205479452
Model SSE 6083.326244705024


In [42]:
# Sum the squares of the baseline errors
model_sse = (evaluate.model_residuals**2).sum()

# Take the average of the Sum of squared errors
# mse = model_sse / len(evaluate)

# Or we could calculate this using sklearn's mean_squared_error function
mse = mean_squared_error(evaluate.y, evaluate.yhat)

# Now we'll take the Square Root of the Sum of Errors
# Taking the square root is nice because the units of the error 
# will be in the same units as the target variable.
rmse = sqrt(mse)

print("Model SSE is", model_sse, " which is the sum sf squared errors")
print("Model MSE is", mse, " which is the average squared error")
print("Model RMSE is", rmse, " which is the square root of the MSE")

Model SSE is 6083.326244705024  which is the sum sf squared errors
Model MSE is 16.666647245767187  which is the average squared error
Model RMSE is 4.082480526073233  which is the square root of the MSE


**4. Evaluate that model's performance and answer if the feature is significant.**

### Repetition Improves Performance!
 * In the next section of your notebook, perform the steps above with the rainfall column as the model's feature. Does this model beat the baseline? Would you prefer the rainfall model over the flyers model?
 * In the next section of your notebook, perform the steps above with the log_rainfall column as the model's feature. Does this model beat the baseline? Would you prefer the log_rainfall model over the flyers model? Would you prefer the model built with log_rainfall over the rainfall model from before?
 * In the next section of your notebook, perform the steps above with the temperature column as the model's only feature. Does this model beat the baseline? Would you prefer the rainfall, log_rainfall, or the flyers model?
 * Which of these 4 single regression models would you want to move forward with?

In [43]:
df.head()

Unnamed: 0,Date,Day,Temperature,Rainfall,Flyers,Price,Sales
0,1/1/17,Sunday,27.0,2.0,15,0.5,10
1,1/2/17,Monday,28.9,1.33,15,0.5,13
2,1/3/17,Tuesday,34.5,1.33,27,0.5,15
3,1/4/17,Wednesday,44.1,1.05,28,0.5,17
4,1/5/17,Thursday,42.4,1.0,33,0.5,18


In [47]:
df["log_rainfall"] = np.log(df.Rainfall)
df.head()

Unnamed: 0,Date,Day,Temperature,Rainfall,Flyers,Price,Sales,log_rainfall
0,1/1/17,Sunday,27.0,2.0,15,0.5,10,0.693147
1,1/2/17,Monday,28.9,1.33,15,0.5,13,0.285179
2,1/3/17,Tuesday,34.5,1.33,27,0.5,15,0.285179
3,1/4/17,Wednesday,44.1,1.05,28,0.5,17,0.04879
4,1/5/17,Thursday,42.4,1.0,33,0.5,18,0.0


In [None]:
evaluate3 = pd.DataFrame()

### Tips dataset

1. Load the tips dataset from pydataset or seaborn
2. Define your baseline for "tip". Our goal will be to see if we can make a model that is better than baseline for predicting tips on total_bill.
3. Fit a linear regression model (ordinary least squares) and compute yhat, predictions of tip using total_bill. Here is some sample code to get you started:

In [3]:
#from statsmodels.formula.api import ols
#from pydataset import data

#df = data("tips")

#model = ols('tip ~ total_bill', data=df).fit()
#predictions = model.predict(df.x)

1. Calculate the sum of squared errors, explained sum of squares, total sum of squares, mean squared error, and root mean squared error for your model.
2. Calculate the sum of squared errors, mean squared error, and root mean squared error for the baseline model (i.e. a model that always predicts the average tip amount).
3. Write python code that compares the sum of squared errors for your model against the sum of squared errors for the baseline model and outputs whether or not your model performs better than the baseline model.
4. What is the amount of variance explained in your model?
5. Is your model significantly better than the baseline model?
6. Plot the residuals for the linear regression model that you made.