## Evaluation Exercises

Do you work for this exercise in either a jupyter notebook named evaluate within your regression-exercises repository. By the end of this exercise you will also create a python script named evaluate, so the overall deliverables for this exercise are the python script and jupyter notebook both with the name evaluate within your regression-exercises repo.

In [1]:
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
from pydataset import data

1. Load the tips dataset from either pydataset or seaborn.

In [2]:
df = data('tips')

In [3]:
# what does it look like?
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 244 entries, 1 to 244
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   total_bill  244 non-null    float64
 1   tip         244 non-null    float64
 2   sex         244 non-null    object 
 3   smoker      244 non-null    object 
 4   day         244 non-null    object 
 5   time        244 non-null    object 
 6   size        244 non-null    int64  
dtypes: float64(2), int64(1), object(4)
memory usage: 15.2+ KB


In [None]:
# create a target variable
# target = 'tip'
# df = df.drop(columns=['sex', 'smoker', 'day', 'time', 'size'])
# df.info()

In [9]:
# create a target variable
target = 'tip'
# create a baseline model
def baseline_regression(df, target):
    '''
    This function will take in a dataframe, and predefined target variable (as a string)
    and create a 'yhat_baseline' column that is the mean of the target variable.
    This will serve as the baseline model to compare against.
    '''
    df['yhat_baseline'] = df[target].mean()
    return df.head(1)

In [10]:
baseline_regression(df, target)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,yhat_baseline,yhat
1,16.99,1.01,Female,No,Sun,Dinner,2,2.998279,2.704636


2. Fit a linear regression model (ordinary least squares) and compute yhat, predictions of tip using total_bill.
- Here is some sample code to get you started:

In [None]:
# from sklearn.linear_model import LinearRegression

# assuming X and y are already defined
# model = LinearRegression().fit(X, y)
# predictions = model.predict(X)

In [11]:
# define variable to build model with
variables = ['total_bill']
# create and fit our LinearRegression model
def linear_model(df, variables, target):
    '''
    This function will take in a dataframe, a predefined variables list
    to build the LinearRegression Model on, and a predefined target 
    variable (as a string). It creates a LinearRegression Model on the list of variables,
    and creates a prediction column labeled 'yhat', and finally returns
    the head of the dataframe.
    '''
    model = LinearRegression().fit(df[['total_bill']], df['tip'])
    df['yhat'] = model.predict(df[['total_bill']])
    return df.head(1)

In [12]:
linear_model(df, variables, target)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,yhat_baseline,yhat
1,16.99,1.01,Female,No,Sun,Dinner,2,2.998279,2.704636


3. Plot the residuals for the linear regression model that you made.

In [None]:
# create the residulas column by subtracting predicted tip from actual tip
df['residuals'] = df['tip'] - df['yhat']

In [None]:
df['residuals'].plot.hist()

In [None]:
# what would the baseline residuals look like?
df['baseline_residuals'] = df['tip'] - df['yhat_baseline']
df['baseline_residuals'].plot.hist()

4. Calculate the sum of squared errors, explained sum of squares, total sum of squares, mean squared error, and root mean squared error for your model.

In [None]:
# set an n value for number of observations
n = df.shape[0]

In [None]:
# sum of squared errors
sse = (df['residuals'] ** 2).sum()
sse

In [None]:
# explained sum of squares
ess = ((df['yhat'] - df['tip'].mean())**2).sum()
ess

In [None]:
# total sum of squares
tss = ((df['tip'] - df['tip'].mean())**2).sum()
tss

In [None]:
# mean squared error
mse = sse / n
mse

In [None]:
# root mean squared error
rmse = math.sqrt(mse)
rmse

In [None]:
# r^2 score

r_sq = ess / tss
r_sq

5. Calculate the sum of squared errors, mean squared error, and root mean squared error for the baseline model (i.e. a model that always predicts the average tip amount).

In [None]:
# baseline sse
sse_baseline = (df['baseline_residuals'] ** 2).sum()
mse_baseline = sse_baseline / n
rmse_baseline = math.sqrt(mse_baseline)

print(f'''
Baseline:

sse:    {sse_baseline}
mse:    {mse_baseline}
rmse:   {rmse_baseline}
''')

In [None]:
# so what does this all look like?

fig, ax = plt.subplots(figsize=(13,7))

ax.hist(df['baseline_residuals'], label='baseline residuals', alpha=.6)
ax.hist(df['residuals'], label='model residuals', alpha=.6)
ax.legend()

In [None]:
# let's look at actual vs predicted tips

fig, ax = plt.subplots(figsize=(13,7))
ax.scatter(df['tip'], df['yhat'], label='predicted')
ax.set(title='Actual vs Predicted Tip Value', ylabel='Predicted Tip', xlabel='Actual Tip')
ax.plot(df['tip'], df['tip'], ls=':', c='gray')

In [None]:
# now let's look at residuals...
fig, ax = plt.subplots(figsize=(13,7))

ax.scatter(df['tip'], df['residuals'])
ax.set(ylabel='residual', xlabel='actual')
ax.hlines(0, *ax.get_xlim(), ls=':', color='gray')

6. Write python code that compares the sum of squared errors for your model against the sum of squared errors for the baseline model and outputs whether or not your model performs better than the baseline model.

In [None]:
target = 'tip'
variables = ['total_bill']

In [None]:
# df['residuals'] = df[target] - df['yhat']
# df.head()

In [None]:
def sse_comp(df, variables, target):
    '''
    This function will take in a pandas DataFrame, and predefiend list of variables to
    apply the LinearRegression Model to, as well as a predefined target variable.
    This function creates and fits a LinearRegression model, creates a 'yhat' or prediction column,
    calculates model residuals. It also creates a baseline model, and calculates baseline
    residuals. The Sum of Squared Errors are calculated for the baseline, and the model.
    Both are printed out, as well as the improvment in SSE over baseline
    '''
    # create and fit our LinearRegression model
    model = LinearRegression().fit(df[variables], df[target])
    # create a yhat/prediction column
    df['yhat'] = model.predict(df[variables])
    # create the residuals column by subtracting predicted tip from actual tip
    df['residuals'] = df[target] - df['yhat']
    # sum of squared errors for model
    sse = (df['residuals'] ** 2).sum()
    # create a baseline model
    df['yhat_baseline'] = df[target].mean()
    # baseline residuals
    df['baseline_residuals'] = df[target] - df['yhat_baseline']
    # baseline sse
    sse_baseline = (df['baseline_residuals'] ** 2).sum()
    print(f'Baseline Model SSE:          {sse_baseline:.2f}')
    print(f'Linear Regression Model SSE: {sse:.2f}')
    print(f'Improvement over baseline:   {sse_baseline-sse:.2f}')
    

In [None]:
sse_comp(df, variables, target)

7. What is the amount of variance explained in your model?

In [None]:
model_r2 = r2_score(df['tip'], df['yhat'])
model_r2.round(4)

8. Is your model better than the baseline model?

In [None]:
baseline_r2 = r2_score(df['tip'], df['yhat_baseline'])

print(f'Linear Regression r2_score: {model_r2:.4f}')
print(f'Baseline r2_score:          {baseline_r2:.4f}')
print(f'Improvement over baseline:  {model_r2-baseline_r2:.4f}')

9. Create a file named evaluate.py that contains the following functions.

- plot_residuals(y, yhat): creates a residual plot

In [None]:
def plot_residuals(df, variables, target):
    '''
        This function will take in a pandas DataFrame, and predefiend list of variables to
    apply the LinearRegression Model to, as well as a predefined target variable.
    This function creates and fits a LinearRegression model, creates a 'yhat' or prediction column,
    calculates model residuals. It also creates a baseline model, and calculates baseline
    residuals. These are then plotted in a histogram.
    '''
     # create and fit our LinearRegression model
    model = LinearRegression().fit(df[variables], df[target])
    # create a yhat/prediction column
    df['yhat'] = model.predict(df[variables])
    # create the residuals column by subtracting predicted tip from actual tip
    df['residuals'] = df[target] - df['yhat']
    # create a baseline model
    df['yhat_baseline'] = df[target].mean()
    # baseline residuals
    df['baseline_residuals'] = df[target] - df['yhat_baseline']
    fig, ax = plt.subplots(figsize=(13,7))
    ax.hist(df['baseline_residuals'], label='baseline residuals', alpha=.6)
    ax.hist(df['residuals'], label='model residuals', alpha=.6)
    ax.legend()
    return plt.show()

In [None]:
plot_residuals(df, variables, target)

9. - regression_errors(y, yhat): returns the following values:
    - sum of squared errors (SSE)
    - explained sum of squares (ESS)
    - total sum of squares (TSS)
    - mean squared error (MSE)
    - root mean squared error (RMSE)

In [None]:
def regression_errors(df, variables, target):
    '''
    This function will take in a pandas DataFrame, and predefiend list of variables to
    apply the LinearRegression Model to, as well as a predefined target variable.
    This function creates and fits a LinearRegression model, creates a 'yhat' or prediction column,
    calculates model residuals. It will then calculate and print out the SSE,
    ESS, TSS, MSE, and RMSE of the model.
    '''
     # create and fit our LinearRegression model
    model = LinearRegression().fit(df[variables], df[target])
    # create a yhat/prediction column
    df['yhat'] = model.predict(df[variables])
    # create the residuals column by subtracting predicted tip from actual tip
    df['residuals'] = df[target] - df['yhat']
    # set an n value for number of observations
    n = df.shape[0]
    # sum of squared errors
    sse = (df['residuals'] ** 2).sum()
    # explained sum of squares
    ess = ((df['yhat'] - df['tip'].mean())**2).sum()
    # total sum of squares
    tss = ((df['tip'] - df['tip'].mean())**2).sum()
    # mean squared error
    mse = sse / n
    # root mean squared error
    rmse = mse ** .5
    
    print(f'    Sum of Square Errors (SSE): {sse:.2f}')
    print(f'Explained Sum of Squares (ESS): {ess:.2f}')
    print(f'    Total Sum of Squares (TSS): {tss:.2f}')
    print(f'      Mean Squared Error (MSE): {mse:.2f}')
    print(f'Root Mean Squared Error (RMSE): {rmse:.2f}')
    
    

In [None]:
regression_errors(df, variables, target)

9. baseline_mean_errors(y): computes the SSE, MSE, and RMSE for the baseline model

In [None]:
def baseline_mean_errors(df, target):
    '''
    Takes in a predefined target variable, creates yhat_baseline,
    and baseline_residuals. Then calculates and prints out SSE, MSE, and RMSE
    for the baseline.
    '''
    # create a baseline model
    df['yhat_baseline'] = df[target].mean()
    # baseline residuals
    df['baseline_residuals'] = df[target] - df['yhat_baseline']
    # set an n value for number of observations
    n = df.shape[0]
    # baseline sum of squared errors
    sse_baseline = (df['baseline_residuals'] ** 2).sum()
    # baseline mean squared error
    mse_baseline = sse_baseline / n
    # baseline root mean squared error     
    rmse_baseline = mse_baseline ** .5
    
    print(f'   Baseline Sum of Squared Errors (SSE): {sse_baseline:.2f}')
    print(f'      Baseline Mean Squared Error (MSE): {mse_baseline:.2f}')
    print(f'Baseline Root Mean Squared Error (RMSE): {rmse_baseline:.2f}')
    

In [None]:
baseline_mean_errors(df, target)

9. better_than_baseline(y, yhat): returns true if your model performs better than the baseline, otherwise false

10. Load the mpg dataset and fit a model that predicts highway mileage based on engine displacement. Take a look at all the regression evaluation metrics, and determine whether this model is better than the baseline model. Use the functions from your evaluate.py to help accomplish this.