In [None]:
# Setup chunk to install and load required packages
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
suppressWarnings(if(!require("pacman")) install.packages("pacman"))

pacman::p_load('tidyverse', 'tidymodels', 'glmnet',
               'randomForest', 'xgboost','patchwork',
               'paletteer', 'here', 'doParallel', 'summarytools')


## Regression - Experiment with more powerful regression models

In the previous notebook, you used simple regression models to look at the relationship between features of a bike rentals dataset. In this notebook, you'll experiment with more complex models to improve your regression performance.

Let's load the bicycle-sharing data as a tibble and view the first few rows. You'll also split the data into training and test datasets.


In [None]:
# Load the required packages and make them available in your current R session
suppressPackageStartupMessages({
  library(tidyverse)
  library(tidymodels)
  library(lubridate)
  library(paletteer)
})

# Import the data into the R session
bike_data <- read_csv(file = "https://raw.githubusercontent.com/MicrosoftDocs/ml-basics/master/data/daily-bike-share.csv", show_col_types = FALSE)

# Parse dates then extract days
bike_data <- bike_data %>%
  mutate(dteday = mdy(dteday)) %>% 
  mutate(day = day(dteday))

# Select desired features and labels
bike_select <- bike_data %>% 
  select(c(season, mnth, holiday, weekday, workingday, weathersit,
           temp, atemp, hum, windspeed, rentals)) %>% 
  mutate(across(1:6, factor))

# Split 70% of the data for training and the rest for testing
set.seed(2056)
bike_split <- bike_select %>% 
  initial_split(prop = 0.7,
  # splitting data evenly on the holiday variable
                strata = holiday)

# Extract the data in each split
bike_train <- training(bike_split)
bike_test <- testing(bike_split)

# Specify multiple regression metrics
eval_metrics <- metric_set(rmse, rsq)


cat("Training Set", nrow(bike_train), "rows",
    "\nTest Set", nrow(bike_test), "rows")


The result is two datasets:

-   **bike_train**: A subset of the dataset used to train the model.
-   **bike_test**: A subset of the dataset used to validate the model.

Now you're ready to train a model by fitting a suitable regression algorithm to the training data.

### Experiment with algorithms

The linear regression algorithm you used last time to train the model has some predictive capability. There are other kinds of regression algorithms you could try:

-   **Linear algorithms**: Not just the linear regression algorithm you used, which is technically an ordinary least squares algorithm, but other variants such as lasso and ridge. Lasso is an acronym for least absolute shrinkage and selection operator.
-   **Tree-based algorithms**: Algorithms that build a decision tree to reach a prediction.
-   **Ensemble algorithms**: Algorithms that combine the outputs of multiple base algorithms to improve generalizability.

For a full list of parsnip model types and engines, see parsnip [model types and engines](https://www.tidymodels.org/find/parsnip/#models) and explore corresponding [model arguments](https://www.tidymodels.org/find/parsnip/#model-args) too.

### Try another linear algorithm

Let's try training the regression model by using a lasso algorithm. In Tidymodels, you change the model specification, and the rest is easy.

Here, you'll set up one model specification for lasso regression. You picked a value for `penalty`. You set `mixture = 1` to specify a lasso model. When mixture = 1, it's a pure lasso model.

You'll also make a model specification in a more succinct way than you did last time.


In [None]:
# Build a lasso model specification
lasso_spec <- linear_reg(
  engine = "glmnet",
  mode = "regression",
  penalty = 1,
  mixture = 1)

# Train a lasso regression model
lasso_mod <- lasso_spec %>% 
  fit(rentals ~ ., data = bike_train)

# Make predictions for test data
results <- bike_test %>% 
  bind_cols(lasso_mod %>% predict(new_data = bike_test) %>% 
              rename(predictions = .pred))

# Evaluate the model
lasso_metrics <- eval_metrics(data = results,
                                    truth = rentals,
                                    estimate = predictions) 


# Plot predicted vs actual
theme_set(theme_light())
lasso_plt <- results %>% 
  ggplot(mapping = aes(x = rentals, y = predictions)) +
  geom_point(size = 1.6, color = 'darkorchid') +
  # overlay regression line
  geom_smooth(method = 'lm', color = 'black', se = F) +
  ggtitle("Daily Bike Share Predictions") +
  xlab("Actual Labels") +
  ylab("Predicted Labels") +
  theme(plot.title = element_text(hjust = 0.5))

# Return evaluations
list(lasso_metrics, lasso_plt)
  


There's not much of an improvement. To improve the performance metrics, you can estimate the right regularization hyperparameter `penalty`. This can be figured out by resampling and tuning the model, which we'll discuss.

### Try a decision tree algorithm

As an alternative to a linear model, a category of algorithms for machine learning uses a tree-based approach. The features in the dataset are examined in a series of evaluations. Each evaluation results in a branch in a decision tree based on the feature value. At the end of each series of branches are leaf nodes with the predicted label value based on the feature values.

It's easiest to see how this process works with an example. Let's train a decision tree regression model by using the bike rental data. After you train the model, the following code prints the model definition and a text representation of the tree it uses to predict label values.


In [None]:
# Build a decision tree specification
tree_spec <- decision_tree(
  engine = "rpart",
  mode = "regression")

# Train a decision tree model 
tree_mod <- tree_spec %>% 
  fit(rentals ~ ., data = bike_train)

# Print model
tree_mod


Now you have a tree-based model. But is it any good? Let's evaluate it with the test data. You'll also try out a new function, `augment()`.

The `augment()` function allows you to make and add model predictions to the given data.


In [None]:
# Make and bind predictions to test data a
results <- tree_mod %>% 
  augment(new_data = bike_test) %>% 
  rename(predictions = .pred)

# Evaluate the model
tree_metrics <- eval_metrics(data = results,
                                  truth = rentals,
                                  estimate = predictions)

# Plot predicted vs actual
tree_plt <- results %>% 
  ggplot(mapping = aes(x = rentals, y = predictions)) +
  geom_point(color = 'tomato') +
  # overlay regression line
  geom_smooth(method = 'lm', color = 'steelblue', se = F) +
  ggtitle("Daily Bike Share Predictions") +
  xlab("Actual Labels") +
  ylab("Predicted Labels") +
  theme(plot.title = element_text(hjust = 0.5))

# Return evaluations
list(tree_metrics, tree_plt)


The tree-based model doesn't seem to have much improvement over the linear model. You can also see that it's predicting constant values for a given range of predictors. What else can you try?

### Try an ensemble algorithm

Ensemble algorithms work by combining multiple base estimators to produce an optimal model. They apply an aggregate function to a collection of base models, which is known as bagging. Or they leverage a sequence of models that build on one another to improve predictive performance, which is known as boosting.

For example, let's try a random forest model. It applies an averaging function to multiple decision tree models for a better overall model.


In [None]:
# For reproducibility
set.seed(2056)

# Build a random forest model specification
rf_spec <- rand_forest() %>% 
  set_engine('randomForest') %>% 
  set_mode('regression')

# Train a random forest model 
rf_mod <- rf_spec %>% 
  fit(rentals ~ ., data = bike_train)

# Print model
rf_mod


Now you have a random forest model. But is it any good? Let's evaluate it with the test data.



In [None]:
# Make and bind predictions to test data a
results <- rf_mod %>% 
  augment(new_data = bike_test) %>% 
  rename(predictions = .pred)


# Evaluate the model
rf_metrics <- eval_metrics(data = results,
                                  truth = rentals,
                                  estimate = predictions)


# Plot predicted vs actual
rf_plt <- results %>% 
  ggplot(mapping = aes(x = rentals, y = predictions)) +
  geom_point(color = '#6CBE50FF') +
  # overlay regression line
  geom_smooth(method = 'lm', color = '#2B7FF9FF', se = F) +
  ggtitle("Daily Bike Share Predictions") +
  xlab("Actual Labels") +
  ylab("Predicted Labels") +
  theme(plot.title = element_text(hjust = 0.5))

# Return evaluations
list(rf_metrics, rf_plt)


That's a step in the right direction.

Let's also try a boosting ensemble algorithm. You'll use a gradient boosting estimator. Like a random forest algorithm, it's based on multiple trees. Instead of building them all independently and taking the average result, each tree is built on the outputs of the previous one. This technique is an attempt to incrementally reduce the loss (error) in the model.

In this tutorial, we'll demonstrate how to implement gradient boosting machines by using the `xgboost` engine.


In [None]:
# For reproducibility
set.seed(2056)

# Build an xgboost model specification
boost_spec <- boost_tree() %>% 
  set_engine('xgboost') %>% 
  set_mode('regression')

# Train an xgboost model 
boost_mod <- boost_spec %>% 
  fit(rentals ~ ., data = bike_train)


# Print model
boost_mod


From the output, you can see that gradient boosting machines combine a series of base models. Each model is created sequentially and depends on the previous models. This technique is an attempt to incrementally reduce the error in the model.

Now you have an XGBoost model. But is it any good? Again, let's evaluate it with the test data.


In [None]:
# Make and bind predictions to test data a
results <- boost_mod %>% 
  augment(new_data = bike_test) %>% 
  rename(predictions = .pred)

# Evaluate the model
boost_metrics <- eval_metrics(data = results,
                                  truth = rentals,
                                  estimate = predictions) 

# Plot predicted vs actual
boost_plt <- results %>% 
  ggplot(mapping = aes(x = rentals, y = predictions)) +
  geom_point(color = '#4D3161FF') +
  # overlay regression line
  geom_smooth(method = 'lm', color = 'black', se = F) +
  ggtitle("Daily Bike Share Predictions") +
  xlab("Actual Labels") +
  ylab("Predicted Labels") +
  theme(plot.title = element_text(hjust = 0.5))

# Return evaluations
list(boost_metrics, boost_plt)


You're definitely getting somewhere. Can you do better?

### Summary

You've tried some new regression algorithms to improve performance. In the next exercise unit, you'll look at tuning these algorithms to improve performance. Then you'll take a look at data preprocessing and model hyperparameters.

### Further reading

To learn more about Tidymodels, see the [Tidymodels documentation](https://www.tidymodels.org/).
