### Evaluating simple linear regressions on lemonade data with other features:

#### Create a dataframe from the csv at https://gist.githubusercontent.com/ryanorsinger/c303a90050d3192773288f7eea97b708/raw/536533b90bb2bf41cea27a2c96a63347cde082a6/lemonade.csv

In [73]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from pydataset import data

# Linear Model
from statsmodels.formula.api import ols

In [74]:
# Let's work with some sales data!
df = pd.read_csv("https://gist.githubusercontent.com/ryanorsinger/9867c96ddb56626e9aac94d8e92dabdf/raw/45f9a36a8871ac0e24317704ed0072c9dded1327/lemonade_regression.csv")

df.head()

Unnamed: 0,temperature,rainfall,flyers,sales
0,27.0,2.0,15,10
1,28.9,1.33,15,13
2,34.5,1.33,27,15
3,44.1,1.05,28,17
4,42.4,1.0,33,18


#### Make a baseline for predicting sales. (The mean is a good baseline)

In [75]:
baseline = df.sales.mean()

baseline

25.323287671232876

#### Create a new dataframe to hold residuals.

In [76]:
residuals = pd.DataFrame()

#### Calculate the baseline residuals.

In [77]:
residuals['x'] = df.flyers
residuals["y"] = df.sales
residuals["baseline"] = baseline
residuals["baseline_residual"] = residuals.baseline - residuals.y
residuals.head()

Unnamed: 0,x,y,baseline,baseline_residual
0,15,10,25.323288,15.323288
1,15,13,25.323288,12.323288
2,27,15,25.323288,10.323288
3,28,17,25.323288,8.323288
4,33,18,25.323288,7.323288


#### Use ols from statsmodels to create a simple linear regression (1 independent variable, 1 dependent variable) to predict sales using flyers. 

In [78]:
model = ols('sales ~ flyers', data=df).fit()

#### Use the .predict method from ols to produce all of our predictions. Add these predictions to the data

In [79]:
residuals["yhat"] = model.predict()

residuals.head()

Unnamed: 0,x,y,baseline,baseline_residual,yhat
0,15,10,25.323288,15.323288,14.673754
1,15,13,25.323288,12.323288,14.673754
2,27,15,25.323288,10.323288,19.727926
3,28,17,25.323288,8.323288,20.149107
4,33,18,25.323288,7.323288,22.255013


#### Calculate that model's residuals.

In [80]:
residuals["yhat_residuals"] = residuals.yhat - residuals.y

residuals.head()

Unnamed: 0,x,y,baseline,baseline_residual,yhat,yhat_residuals
0,15,10,25.323288,15.323288,14.673754,4.673754
1,15,13,25.323288,12.323288,14.673754,1.673754
2,27,15,25.323288,10.323288,19.727926,4.727926
3,28,17,25.323288,8.323288,20.149107,3.149107
4,33,18,25.323288,7.323288,22.255013,4.255013


####  Evaluate that model's performance and answer if the model is significant.

In [83]:
baseline_sse = (residuals.baseline_residual**2).sum()
flyer_model_sse = (residuals.yhat_residuals**2).sum()

if flyer_model_sse < baseline_sse:
    print("Our model beats the baseline")
else:
    print("Our baseline is better than the model.")

print("\nBaseline SSE", baseline_sse)
print("\nModel SSE", flyer_model_sse)

Our model beats the baseline

Baseline SSE 17297.85205479452

Model SSE 6083.326244705024


In [49]:
r2 = model.rsquared
print('R-squared = ', round(r2,3))

R-squared =  0.648


In [50]:
f_pval = model.f_pvalue
print("p-value for model significance = ", f_pval)

p-value for model significance =  2.193718738113383e-84


In [51]:
print('Since p value is less than alpha (.05), we reject the null hypothesis. Our model is significant.')

Since p value is less than alpha (.05), we reject the null hypothesis. Our model is significant.


####  Evaluate that model's performance and answer if the feature is significant.

In [52]:
print('The feature is significant since the model is significant and this is the only feature it uses.')

The feature is significant since the model is significant and this is the only feature it uses.


### Repetition Improves Performance!

#### In the next section of your notebook, perform the steps above with the rainfall column as the model's feature. 

In [86]:
df = pd.read_csv("https://gist.githubusercontent.com/ryanorsinger/9867c96ddb56626e9aac94d8e92dabdf/raw/45f9a36a8871ac0e24317704ed0072c9dded1327/lemonade_regression.csv")

baseline = df.sales.mean()

residuals = pd.DataFrame()

residuals['x'] = df.rainfall
residuals["y"] = df.sales

residuals["baseline"] = baseline
residuals["baseline_residual"] = residuals.baseline - residuals.y

model = ols('sales ~ rainfall', data=df).fit()

residuals["yhat"] = model.predict()

residuals["yhat_residuals"] = residuals.yhat - residuals.y

residuals.head()

Unnamed: 0,x,y,baseline,baseline_residual,yhat,yhat_residuals
0,2.0,10,25.323288,15.323288,-1.599602,-11.599602
1,1.33,13,25.323288,12.323288,13.773142,0.773142
2,1.33,15,25.323288,10.323288,13.773142,-1.226858
3,1.05,17,25.323288,8.323288,20.197573,3.197573
4,1.0,18,25.323288,7.323288,21.344793,3.344793


In [87]:
r2 = model.rsquared

f_pval = model.f_pvalue

print("p-value for model significance = ", f_pval,'\n')

print('Since p value is less than alpha (.05), we reject the null hypothesis. Our model is significant.')

p-value for model significance =  3.2988846597381e-140 

Since p value is less than alpha (.05), we reject the null hypothesis. Our model is significant.


#### Does this model beat the baseline? 

In [88]:
baseline_sse = (residuals.baseline_residual**2).sum()
rainfall_model_sse = (residuals.yhat_residuals**2).sum()

if rainfall_model_sse < baseline_sse:
    print("Our model beats the baseline")
else:
    print("Our baseline is better than the model.")

print("\nBaseline SSE", baseline_sse)
print("\nModel SSE", rainfall_model_sse)

Our model beats the baseline

Baseline SSE 17297.85205479452

Model SSE 2998.2371310300655


#### Would you prefer the rainfall model over the flyers model?

In [94]:
print(f'Rainfall model SSE: {rainfall_model_sse}')
print(f'Flyer model SSE: {flyer_model_sse}')
print(f'\nSince the rainfall model has a lower SSE, I would prefer it over the flyer model.')

Rainfall model SSE: 2998.2371310300655
Flyer model SSE: 6083.326244705024

Since the rainfall model has a lower SSE, I would prefer it over the flyer model.


#### In the next section of your notebook, perform the steps above with the log_rainfall column as the model's feature. 

In [95]:
df = pd.read_csv("https://gist.githubusercontent.com/ryanorsinger/9867c96ddb56626e9aac94d8e92dabdf/raw/45f9a36a8871ac0e24317704ed0072c9dded1327/lemonade_regression.csv")

df["log_rainfall"] = np.log(df.rainfall)

baseline = df.sales.mean()

residuals = pd.DataFrame()

residuals['x'] = df.log_rainfall
residuals["y"] = df.sales

residuals["baseline"] = baseline
residuals["baseline_residual"] = residuals.baseline - residuals.y

model = ols('sales ~ log_rainfall', data=df).fit()

residuals["yhat"] = model.predict()

residuals["yhat_residuals"] = residuals.yhat - residuals.y

residuals.head()

Unnamed: 0,x,y,baseline,baseline_residual,yhat,yhat_residuals
0,0.693147,10,25.323288,15.323288,3.688573,-6.311427
1,0.285179,13,25.323288,12.323288,13.198359,0.198359
2,0.285179,15,25.323288,10.323288,13.198359,-1.801641
3,0.04879,17,25.323288,8.323288,18.708608,1.708608
4,0.0,18,25.323288,7.323288,19.845912,1.845912


In [96]:
r2 = model.rsquared

f_pval = model.f_pvalue

print("p-value for model significance = ", f_pval,'\n')

print('Since p value is less than alpha (.05), we reject the null hypothesis. Our model is significant.')

p-value for model significance =  1.2242624097795882e-230 

Since p value is less than alpha (.05), we reject the null hypothesis. Our model is significant.


#### Does this model beat the baseline? 

In [97]:
baseline_sse = (residuals.baseline_residual**2).sum()
rainfall_log_model_sse = (residuals.yhat_residuals**2).sum()

if rainfall_log_model_sse < baseline_sse:
    print("Our model beats the baseline")
else:
    print("Our baseline is better than the model.")

print("\nBaseline SSE", baseline_sse)
print("\nModel SSE", rainfall_log_model_sse)

Our model beats the baseline

Baseline SSE 17297.85205479452

Model SSE 952.3253474293448


#### Would you prefer the log_rainfall model over the flyers model? 

In [103]:
print(f'Rainfall log model SSE: {rainfall_log_model_sse}')
print(f'Flyer model SSE: {flyer_model_sse}')
print(f'\nSince the rainfall log model has a lower SSE, I would prefer it over the flyer model.')

Rainfall log model SSE: 952.3253474293448
Flyer model SSE: 6083.326244705024

Since the rainfall log model has a lower SSE, I would prefer it over the flyer model.


#### Would you prefer the model built with log_rainfall over the rainfall model from before?

In [104]:
print(f'Rainfall log model SSE: {rainfall_log_model_sse}')
print(f'Rainfall model SSE: {rainfall_model_sse}')
print(f'\nSince the rainfall log model has a lower SSE, I would prefer it over the rainfall model.')

Rainfall log model SSE: 952.3253474293448
Rainfall model SSE: 2998.2371310300655

Since the rainfall log model has a lower SSE, I would prefer it over the rainfall model.


#### In the next section of your notebook, perform the steps above with the temperature column as the model's only feature. Does this model beat the baseline? Would you prefer the rainfall, log_rainfall, or the flyers model?

#### Which of these 4 single regression models would you want to move forward with?