# Required Maggie Exercises

for practice:

Use any dataset we have previously used that is not grades

1. set baseline predictions (mean, median)
2. evaluate the baseline (we are comparing y (actual values) to the predicted values, which are all the same value...the mean of y, e.g.)
    - y: 19, 18, 12, 8, 5
    - y_pred: 11, 11, 11, 11, 11
    - LinearRegression()
    - LassoLars()
    - PolynomialFeatures(degree=2) ... then LinearRegression()
3. for each one, evaluate with training predictions, and then with validate predictions


In [1]:
import pandas as pd
import numpy as np
import wrangle
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings("ignore")

from pydataset import data
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression, LassoLars
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import TweedieRegressor

In [2]:
train, validate, test = wrangle.wrangle_telco()
train.shape, validate.shape, test.shape

((1224, 7), (216, 7), (255, 7))

In [3]:
# we have a scaled dataframe, drop columns not needed in modeling
strain = train.drop(columns=['customer_id', 'monthly_charges', 'tenure', 'total_charges_scaled'])
strain.shape

(1224, 3)

In [4]:
# repeat drop columns for validate and test scaled dataframes
svalidate = validate.drop(columns=['customer_id', 'monthly_charges', 'tenure', 'total_charges_scaled'])
stest = test.drop(columns=['customer_id', 'monthly_charges', 'tenure', 'total_charges_scaled'])
svalidate.shape

(216, 3)

In [5]:
# want to predict 'total_charges' and not 'total_charges_scaled' - not sure if that will work
# split into X and y
X_strain = strain.drop(columns=['total_charges'])
X_svalidate = svalidate.drop(columns=['total_charges'])
X_stest = stest.drop(columns=['total_charges'])

y_strain = strain[['total_charges']]
y_svalidate = svalidate[['total_charges']]
y_stest = stest[['total_charges']]

### Find the baseline

In [6]:
# 1st find the mean and/or median of target variable
np.mean(y_strain)
# np.median(y_strain) = 0.4268

total_charges    3746.538235
dtype: float64

In [7]:
# now find the root mean squared error usinging the mean as the predicted value for all oberservations

# np.full will create an array of the specified length and fill all observations with the given value
# len(y_strain) # there are 1224 observations in our y_strain array

baseline_rmse = mean_squared_error(y_strain, np.full(1224, np.mean(y_strain)))**(1/2)
print('Baseline RMSE:', baseline_rmse)

Baseline RMSE: 2594.1553229464807


### Linear Regression()

In [8]:
# create the model, fit the model, use the model
lm = LinearRegression(normalize=True)
lm.fit(X_strain, y_strain)
lm_pred = lm.predict(X_strain)

In [9]:
# Evaluate: compute root mean squared error
lm_rmse = mean_squared_error(y_strain, lm_pred)**(1/2)
lm_rmse

502.88782232713487

In [10]:
# 2nd best performing model

**Validate**

In [11]:
# create the validate model, fit and use 
lm_v = LinearRegression(normalize=True)
lm_v.fit(X_svalidate, y_svalidate)
lm_v_pred = lm_v.predict(X_svalidate)

In [12]:
# Evaluate: compute root mean squared error
lm_v_rmse = mean_squared_error(y_svalidate, lm_v_pred)**(1/2)
lm_v_rmse

505.7788871949105

In [13]:
# very similar performance on validate

### Lasso/Lars()

In [14]:
# create the model, fit the model, use the model
lars = LassoLars(alpha=1)
lars.fit(X_strain, y_strain)
lars_pred = lars.predict(X_strain)

In [15]:
# Evaluate: compute root mean squared error
lars_rmse = mean_squared_error(y_strain, lars_pred)**(1/2)
lars_rmse

504.5981680260354

In [16]:
# 3rd best

**Validate**

In [17]:
# create the validate model, fit and use 
lars_v = LassoLars(alpha=1)
lars_v.fit(X_svalidate, y_svalidate)
lars_v_pred = lars_v.predict(X_svalidate)

In [18]:
# Evaluate: compute root mean squared error
lars_v_rmse = mean_squared_error(y_svalidate, lars_v_pred)**(1/2)
lars_v_rmse

506.09783032159794

In [19]:
# very similar performance on validate

### Polynomial Freatures + Linear Regression()

In [20]:
# create the model, fit the model, use the model
# make the polynomial thing
pf = PolynomialFeatures(degree=2)

# fit and transform the thing
# to get a new set of features..which are the original features sqauared
X_strain = pf.fit_transform(X_strain)

# feed that data into our linear model. 
# make the thing
lm_squared = LinearRegression()
lm_squared.fit(X_strain, y_strain)
lm_squared_pred = lm_squared.predict(X_strain)

In [21]:
# Evaluate: compute root mean squared error
lm_squared_rmse = mean_squared_error(y_strain, lm_squared_pred)**(1/2)
lm_squared_rmse

84.75294691352919

In [22]:
# this is the best of the 4 models, however, it may be overfit

**validate**

In [23]:
# create the model, fit the model, use the model
# make the polynomial thing
pf_v = PolynomialFeatures(degree=2)

# fit and transform the thing
# to get a new set of features..which are the original features sqauared
X_svalidate = pf_v.fit_transform(X_svalidate)

# feed that data into our linear model. 
# make the thing
lm_v_squared = LinearRegression()
lm_v_squared.fit(X_svalidate, y_svalidate)
lm_v_squared_pred = lm_v_squared.predict(X_svalidate)

In [24]:
# Evaluate: compute root mean squared error
lm_v_squared_rmse = mean_squared_error(y_svalidate, lm_v_squared_pred)**(1/2)
lm_v_squared_rmse

79.88687866976178

In [25]:
# very similar performance on validate
# suprising that this performed well and not overfit

### TweedieRegressor()

In [26]:
# create the model, fit the model, use the model
tr = TweedieRegressor()
tr.fit(X_strain, y_strain)
tr_pred = tr.predict(X_strain)

In [27]:
# Evaluate: compute root mean squared error
tr_rmse = mean_squared_error(y_strain, tr_pred)**(1/2)
tr_rmse

1895.9348095952935

In [28]:
# while this RMSE is an improvement from the Baseline it is the worst of the 4 models

**validate**

In [29]:
# create the model, fit the model, use the model
tr_v = TweedieRegressor()
tr_v.fit(X_svalidate, y_svalidate)
tr_v_pred = tr_v.predict(X_svalidate)

In [30]:
# Evaluate: compute root mean squared error
tr_v_rmse = mean_squared_error(y_svalidate, tr_v_pred)**(1/2)
tr_v_rmse

1883.1241715662945

In [31]:
# very similar performance on validate

# same above with different dataset

In [32]:
tips = data('tips')
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
1,16.99,1.01,Female,No,Sun,Dinner,2
2,10.34,1.66,Male,No,Sun,Dinner,3
3,21.01,3.5,Male,No,Sun,Dinner,3
4,23.68,3.31,Male,No,Sun,Dinner,2
5,24.59,3.61,Female,No,Sun,Dinner,4


In [33]:
tips.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 244 entries, 1 to 244
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   total_bill  244 non-null    float64
 1   tip         244 non-null    float64
 2   sex         244 non-null    object 
 3   smoker      244 non-null    object 
 4   day         244 non-null    object 
 5   time        244 non-null    object 
 6   size        244 non-null    int64  
dtypes: float64(2), int64(1), object(4)
memory usage: 15.2+ KB


In [34]:
# it makes sense to convert object type columns to numeric values at this point
# create a mask to identify the object columns
mask = np.array(tips.dtypes == 'object')
# create a df using the mask
objdf = tips.iloc[:, mask]
# get dummies
dummy_df = pd.get_dummies(objdf, dummy_na=False, drop_first=True)
# put the dummies with the original
df = pd.concat([tips, dummy_df], axis=1)
# drop the columns from the original we now have dummies for
df.drop(columns=objdf.columns, inplace=True)

In [35]:
df.head()

Unnamed: 0,total_bill,tip,size,sex_Male,smoker_Yes,day_Sat,day_Sun,day_Thur,time_Lunch
1,16.99,1.01,2,0,0,0,1,0,0
2,10.34,1.66,3,1,0,0,1,0,0
3,21.01,3.5,3,1,0,0,1,0,0
4,23.68,3.31,2,1,0,0,1,0,0
5,24.59,3.61,4,0,0,0,1,0,0


In [36]:
df = df.rename(columns={'size': 'party_size'})

In [37]:
# create some additional features
df['price_person'] = df.total_bill / df.party_size
df['percentage_tip'] = df.tip / df.total_bill
df.head()

Unnamed: 0,total_bill,tip,party_size,sex_Male,smoker_Yes,day_Sat,day_Sun,day_Thur,time_Lunch,price_person,percentage_tip
1,16.99,1.01,2,0,0,0,1,0,0,8.495,0.059447
2,10.34,1.66,3,1,0,0,1,0,0,3.446667,0.160542
3,21.01,3.5,3,1,0,0,1,0,0,7.003333,0.166587
4,23.68,3.31,2,1,0,0,1,0,0,11.84,0.13978
5,24.59,3.61,4,0,0,0,1,0,0,6.1475,0.146808


In [38]:
from sklearn.model_selection import train_test_split
# now split the data into train, validate, test
train_validate, test = train_test_split(df, test_size=.2, random_state=123)
train, validate = train_test_split(train_validate, test_size=.3, random_state=123)
train.shape, validate.shape, test.shape

((136, 11), (59, 11), (49, 11))

In [39]:
# separate into X and y datasets
X_train = train.drop(columns=['tip'])
X_validate = validate.drop(columns=['tip'])
X_test = test.drop(columns=['tip'])

y_train = train[['tip']]
y_validate = validate[['tip']]
y_test = test[['tip']]

In [40]:
# now scale X_train dataset
from sklearn.preprocessing import MinMaxScaler
# scaling data, not sure MinMaxScaler is the best one to use here, but proceeding with this one to save time
scaler = MinMaxScaler(copy=True).fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_validate_scaled = scaler.transform(X_validate)
X_test_scaled = scaler.transform(X_test)
# note this returns X_train_scaled as an array

In [41]:
# convert scaled array back to df
# convert array to dataframe
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns.values).set_index([X_train.index.values])
X_validate_scaled = pd.DataFrame(X_validate_scaled, columns=X_validate.columns.values).set_index([X_validate.index.values])
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns.values).set_index([X_test.index.values])

### Baseline

In [42]:
# 1st find the mean and/or median of target variable
#np.mean(y_train) #2.947
np.median(y_train) #2.68
# will use median due to outliers

2.68

In [43]:
# now find the root mean squared error usinging the mean as the predicted value for all oberservations

# np.full will create an array of the specified length and fill all observations with the given value
# len(y_train) # there are 136 observations in our y_strain array

baseline_rmse = mean_squared_error(y_train, np.full(136, np.median(y_train)))**(1/2)
print('Baseline RMSE:', baseline_rmse)

Baseline RMSE: 1.475600326487295


### Linear Regression()

In [44]:
# create the model, fit the model, use the model
lmt = LinearRegression(normalize=True)
lmt.fit(X_train_scaled, y_train)
lmt_pred = lmt.predict(X_train_scaled)

In [45]:
# Evaluate: compute root mean squared error
lmt_rmse = mean_squared_error(y_train, lmt_pred)**(1/2)
lmt_rmse

0.5574411659250524

In [46]:
# this looks like a pretty good model

**Validate**

In [47]:
# create the validate model, fit and use 
lmt_v = LinearRegression(normalize=True)
lmt_v.fit(X_validate_scaled, y_validate)
lmt_v_pred = lmt_v.predict(X_validate_scaled)

In [48]:
# Evaluate: compute root mean squared error
lmt_v_rmse = mean_squared_error(y_validate, lmt_v_pred)**(1/2)
lmt_v_rmse

0.43574645795053957

In [49]:
# this is better than the train result

### Lasso/Lars()

In [50]:
# create the model, fit the model, use the model
larst = LassoLars(alpha=1)
larst.fit(X_train_scaled, y_train)
larst_pred = larst.predict(X_train_scaled)

In [51]:
# Evaluate: compute root mean squared error
larst_rmse = mean_squared_error(y_train, larst_pred)**(1/2)
larst_rmse

1.4512460770849047

In [52]:
# this is barely better than the baseline of 1.475

**Validate**

In [53]:
# create the validate model, fit and use 
larst_v = LassoLars(alpha=1)
larst_v.fit(X_validate_scaled, y_validate)
larst_v_pred = larst_v.predict(X_validate_scaled)

In [54]:
# Evaluate: compute root mean squared error
larst_v_rmse = mean_squared_error(y_validate, larst_v_pred)**(1/2)
larst_v_rmse

1.4971893422202531

In [55]:
# this is worse than the baseline

### Polynomial Freatures + Linear Regression()

In [56]:
# create the model, fit the model, use the model
# make the polynomial thing
pf = PolynomialFeatures(degree=2)

In [57]:
# fit and transform the thing
# to get a new set of features..which are the original features sqauared
X_train_scaledp = pf.fit_transform(X_train_scaled)

In [58]:
# feed that data into our linear model. 
# make the thing
lmt_squared = LinearRegression()
lmt_squared.fit(X_train_scaledp, y_train)
lmt_squared_pred = lmt_squared.predict(X_train_scaledp)

In [59]:
# Evaluate: compute root mean squared error
lmt_squared_rmse = mean_squared_error(y_train, lmt_squared_pred)**(1/2)
lmt_squared_rmse

4.320356304405854e-15

In [60]:
# this is a very small error, however, it may be overfit

**validate**

In [61]:
# create the model, fit the model, use the model
# make the polynomial thing
pf_v = PolynomialFeatures(degree=2)

# fit and transform the thing
# to get a new set of features..which are the original features sqauared
X_validate_scaledp = pf_v.fit_transform(X_validate_scaled)

In [62]:
# feed that data into our linear model. 
# make the thing
lmt_v_squared = LinearRegression()
lmt_v_squared.fit(X_validate_scaledp, y_validate)
lmt_v_squared_pred = lmt_v_squared.predict(X_validate_scaledp)

In [63]:
# Evaluate: compute root mean squared error
lmt_v_squared_rmse = mean_squared_error(y_validate, lmt_v_squared_pred)**(1/2)
lmt_v_squared_rmse

3.2581221974017203e-15

In [64]:
# this is even smaller than the train RMSE

**test**

In [66]:
# create the model, fit the model, use the model
# make the polynomial thing
pf_t = PolynomialFeatures(degree=2)

# fit and transform the thing
# to get a new set of features..which are the original features sqauared
X_test_scaledp = pf_t.fit_transform(X_test_scaled)

# feed that data into our linear model. 
# make the thing
lmt_t_squared = LinearRegression()
lmt_t_squared.fit(X_test_scaledp, y_test)
lmt_t_squared_pred = lmt_t_squared.predict(X_test_scaledp)

In [67]:
# Evaluate: compute root mean squared error
lmt_t_squared_rmse = mean_squared_error(y_test, lmt_t_squared_pred)**(1/2)
lmt_t_squared_rmse

8.59770276642386e-15

In [None]:
# not quite as good as performance on train and validate, but still the best of all models

### TweedieRegressor()

In [None]:
# create the model, fit the model, use the model
trt = TweedieRegressor()
trt.fit(X_train_scaled, y_train)
trt_pred = trt.predict(X_train_scaled)

In [None]:
# Evaluate: compute root mean squared error
trt_rmse = mean_squared_error(y_train, trt_pred)**(1/2)
trt_rmse

In [None]:
# this is not much better than the baseline

**validate**

In [None]:
# create the model, fit the model, use the model
trt_v = TweedieRegressor()
trt_v.fit(X_validate_scaled, y_validate)
trt_v_pred = trt_v.predict(X_validate_scaled)

In [None]:
# Evaluate: compute root mean squared error
trt_v_rmse = mean_squared_error(y_validate, trt_v_pred)**(1/2)
trt_v_rmse

In [None]:
# this is barely lower than the baseline of 1.475

# Model Exercises - in ciriculum
Using the data on student grades from this lesson, complete the following:

1. Split the data into train, validate, and test datasets.
2. Create a model that uses exam 1 to predict the final grade.
3. Create a model that uses exam 2 to predict the final grade.
4. Compare your models in the following manner:
    - Calculate the mean squared error
    - Visualize the residuals. Create a seperate visualization for each model.
    - Visualize the actual vs the predicted values. Create a seperate visualization for each model.
    - Bonus: Combine the seperate visualizations for each model into a single visualization. Is this visual helpful?
5. Create a model that uses exam 1 and exam 3 to predict final grade. How does this model compare to your previous ones?
6. Take your best preforming model and measure its performance on the validate data set. How does the performance differ between train and validate?
7. Make a 4th model with a slight difference like one more/less feature or a single hyperparameter that's different to see if you can beat that the last model's performance on validate.
8. Tune your models using validate to improve performance. Select the model w/ the best performance and evaluate that one on test, to get a more clear understanding of how it will perform on out-of-sample data.

**Our scenario continues:**

As a customer analyst, I want to know who has spent the most money with us over their lifetime. I have monthly charges and tenure, so I think I will be able to use those two attributes as features to estimate total_charges. I need to do this within an average of $5.00 per customer.

1. Run all your previous scripts that acquired, prepared, split, and scaled the telco churn data.
2. it 3 different linear models to your data, one with just tenure, one with just monthly_charges, and one with both.
3. Evaluate the models and your baseline.
4. Select the model that performed the best, and evaluate it with your validate data.
5. Make a 4th model with a slight difference like one more feature or a single hyperparameter that's different to see if you can beat that the last model's performance on validate.
6. Tune your models using validate to improve performance. Select the model w/ the best performance and evaluate that one on test, to get a more clear understanding of how it will perform on out-of-sample data.
