<a href="https://colab.research.google.com/github/Pattiecodes/DC_DS-in-R/blob/main/Module_4_Feature_Engineering_in_R_R.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# --- Module Start ---



# A tentative model
You are handed a data set with measures of the gravitational force between two bodies at different distances and are challenged to build a simple model to predict such force given a specific distance. Initially, you want to stick to simple linear regression. The data consist of 120 pairs of distance and force, and is loaded for you as newton.

Instructions
100 XP
Build a linear model for the newton data using the linear model from base R function and assign it to lr_force.
Create a new data frame df by binding the prediction values to the original newton data.
Generate a scatterplot of force versus distance using ggplot().
Add a regression line to the scatterplot with the fitted values.

In [None]:
# Build a linear model for the newton the data and assign it to lr_force
lr_force <- lm(force ~ distance, data = newton)

# Create a new data frame by binding the prediction values to the original data
df <- newton %>% bind_cols(lr_pred = predict(lr_force))

# Generate a scatterplot of force vs. distance
df %>%
  ggplot(aes(x = distance, y = force)) +
  geom_point() +
# Add a regression line with the fitted values
  geom_line(aes(y = lr_pred), color = "blue", lwd = .75) +
  ggtitle("Linear regression of force vs. distance") +
  theme_classic()


# Manually engineering a feature
After doing some research with your team, you recall that the gravitational force of attraction between two bodies obeys Newton's formula:


$$
F = G \frac{m_1 m_2}{r^2}
$$.

You can't use the formula directly because the masses are unknown, but you can fit a regression model of force as a function of inv_square_distance. The augmented dataset df you built in the previous exercise has been loaded for you.

Instructions
100 XP
Create a new variable inv_square_distance defined as the reciprocal of the squared distance and add it to the df data frame.
Build a simple regression model using lm() of force versus inv_square_distance and save it as lr_force_2.
Bind your predictions to df_inverse.

In [None]:
# Create a new variable inv_square_distance
df_inverse <- df %>% mutate(inv_square_distance = 1/distance^2)

# Build a simple regression model
lr_force_2 <- lm(force ~ inv_square_distance, data = df_inverse)

# Bind your predictions to df_inverse
df_inverse <- df_inverse %>% bind_cols(lr2_pred = predict(lr_force_2))

df_inverse %>% ggplot(aes(x = distance, y = force)) +
  geom_point() +
  geom_line(aes(y = lr2_pred), col = "blue", lwd = .75) +
  ggtitle("Linear regression of force vs. inv_square_distance") +
  theme_classic()

# Setting up your data for analysis
You will look at a version of the nycflights13 dataset, loaded as flights. It contains information on flights departing from New York City. You are interested in predicting whether or not they will arrive late to their destination, but first, you need to set up the data for analysis.

After discussing our model goals with a team of experts, you selected the following variables for your model: flight, sched_dep_time, dep_delay, sched_arr_time, carrier, origin, dest, distance, date, arrival.

You will also mutate() the date using as.Date() and convert character type variables to factors.

Lastly, you will split the data into train and test datasets.

Instructions
100 XP
Transform all character-type variables to factors.
Split the flights data into test and train sets.

In [None]:
flights <- flights %>%
  select(flight, sched_dep_time, dep_delay, sched_arr_time, carrier, origin, dest, distance, date, arrival) %>%

# Tranform all character-type variables to factors
  mutate(date = as.Date(date), across(where(is.character), as.factor))

# Split the flights data into test and train sets
set.seed(246)
split <- flights %>% initial_split(prop = 3/4, strata = arrival)
test <- testing(split)
train <- training(split)

test %>% select(arrival) %>% table() %>% prop.table()
train %>% select(arrival) %>% table() %>% prop.table()

# Building a workflow
With your data ready for analysis, you will declare a logistic_model() to predict whether or not they will arrive late.

You assign the role of "ID" to the flight variable to keep it as a reference for analysis and debugging. From the date variable, you will create new features to explicitly model the effect of holidays and represent factors as dummy variables.

Bundling your model and recipe() together using workflow()will help ensure that subsequent fittings or predictions will implement consistent feature engineering steps.

Instructions
100 XP
Assign an "ID" role to flight.
Bundle the model and the recipe into a workflow object.
Fit lr_workflow to the test data.
Tidy the fitted workflow.

In [None]:
lr_model <- logistic_reg()

# Assign an "ID" role to flight
lr_recipe <- recipe(arrival ~., data = train) %>% update_role(flight, new_role = "ID") %>%
  step_holiday(date, holidays = timeDate::listHolidays("US")) %>% step_dummy(all_nominal_predictors())

# Bundle the model and the recipe into a workflow object
lr_workflow <- workflow() %>% add_model(lr_model) %>% add_recipe(lr_recipe)
lr_workflow

# Fit lr_workflow workflow to the test data
lr_fit <- lr_workflow %>% fit(data = test)

# Tidy the fitted workflow
tidy(lr_fit)

# Identifying missing values
Attrition is a critical issue for corporations, as losing an employee implies not only the cost of recruiting and training a new one, but constitutes a loss in tacit knowledge and culture that is hard to recover.

The attritiondataset has information on employee attrition including Age, WorkLifeBalance, DistanceFromHome, StockOptionLevel, and 27 others. Before continuing with your analysis, you want to detect any missing variables.

The package naniar and the attritiondataset are already loaded for you.

In [None]:
# Explore missing data on the attrition dataset
vis_miss(attrition)

Instructions 2/2

Select the variables with missing values and visualize only those.

In [None]:
# Select the variables with missing values and rerun the analysis on those variables.
attrition %>%
  select("BusinessTravel", "DistanceFromHome",
         "StockOptionLevel", "WorkLifeBalance") %>%
  vis_miss()

#Imputing missing values and creating dummy variables
After detecting missing values in the attrition dataset and determining that they are missing completely at random (MCAR), you decide to use K Nearest Neighbors (KNN) imputation. While configuring your feature engineering recipe, you decide to create dummy variables for all your nominal variables and update the role of the ...1 variable to "ID" so you can keep it in the dataset for reference, without affecting your model.

Instructions
100 XP
Update the role of ...1 to "ID".
Impute values to all predictors where data are missing.
Create dummy variables for all nominal predictors.

In [None]:
lr_model <- logistic_reg()

lr_recipe <-
  recipe(Attrition ~., data = train) %>%

# Update the role of "...1" to "ID"
  update_role(...1, new_role = "ID" ) %>%

# Impute values to all predictors where data are missing
  step_impute_knn(all_predictors()) %>%

# Create dummy variables for all nominal predictors
  step_dummy(all_nominal_predictors())

lr_recipe

# Fitting and assessing the model
Now that you have addressed missing values and created dummy variables, it is time to assess your model's performance!

The attritiondataset, along with the testand train splits, the lr_recipe and your declared logistic_model() are all loaded for you.

Instructions
100 XP
Bundle model and recipe in workflow.
Fit workflow to the train data.
Generate an augmented data frame for performance assessment.

In [None]:
# Bundle model and recipe in workflow
lr_workflow <- workflow() %>%
  add_model(lr_model) %>%
  add_recipe(lr_recipe)

# Fit workflow to the train data
lr_fit <- fit(lr_workflow, data = train)

# Generate an augmented data frame for performance assessment
lr_aug <- lr_fit %>% augment(test)

lr_aug %>% roc_curve(truth = Attrition, .pred_No) %>% autoplot()
bind_rows(lr_aug %>% roc_auc(truth = Attrition, .pred_No),
          lr_aug %>% accuracy(truth = Attrition, .pred_class))

# Predicting hotel bookings
You just got a job at a hospitality research company, and your first task is to build a model that predicts whether or not a hotel stay will include children. To train your model, you will rely on a modified version of the hotel booking demands dataset by Antonio, Almeida, and Nunes (2019). You are restricting your data to the following features:
```
features <- c('hotel', 'adults',
              'children', 'meal',
              'reserved_room_type',
              'customer_type',
              'arrival_date')
```
The data has been loaded for you as hotels, along with its corresponding test and train splits, and the model has been declared as lr_model <- logistic_reg().

You will assess model performance by accuracy and area under the ROC curve or AUC.

Instructions 1/2
50 XP
2
Generate "day of the week", "week" and "month" features.
Create dummy variables for all nominal predictors.

In [None]:
lr_recipe <-
  recipe(children ~., data = train) %>%
# Generate "day of the week", "week" and "month" features

  step_date(arrival_date, features = c("dow", "week", "month")) %>%

# Create dummy variables for all nominal predictors
  step_dummy(all_nominal_predictors())

Instructions 2/2
50 XP
Bundle your model and recipe in a workflow().
Fit the workflow to the training data.

In [None]:
lr_recipe <- recipe(children ~., data = train) %>%

# Generate "day of the week", "week" and "month" features
  step_date(arrival_date, features = c("dow", "week", "month")) %>%

# Create dummy variables for all nominal predictors
  step_dummy(all_nominal_predictors())

# Bundle your model and recipe in a workflow
lr_workflow <-workflow() %>% add_model(lr_model) %>% add_recipe(lr_recipe)

# Fit the workflow to the training data
lr_fit <-  lr_workflow %>% fit(data = train)
lr_aug <- lr_fit %>% augment(test)
bind_rows(roc_auc(lr_aug,truth = children, .pred_children),accuracy(lr_aug,truth = children, .pred_class))

# Normalizing and log-transforming
You are handed a dataset, attrition_num with numerical data about employees who left the company. Features include Age, DistanceFromHome, and MonthlyRate.

You want to use this data to build a model that can predict if an employee is likely to stay, denoted by Attrition, a binary variable coded as a factor. In preparation for modeling, you want to reduce possible skewness and prevent some variables from outweighing others due to variations in scale.

The attrition_numdata and the trainand test splits are loaded for you.

Instructions
100 XP
Normalize all numeric predictors.
Log-transform all numeric features, with an offset of one.

Take Hint (-30 XP)

In [None]:
lr_model <- logistic_reg()

lr_recipe <-
  recipe(Attrition~., data = train) %>%

# Normalize all numeric predictors
  step_normalize(all_numeric_predictors()) %>%

# Log-transform all numeric features, with an offset of one
  step_log(all_numeric_predictors(), offset = 1)

lr_workflow <-
  workflow() %>%
  add_model(lr_model) %>%
  add_recipe(lr_recipe)

lr_workflow

# Fit and augment
With your lr_workflow ready to go, you can fit it to the test data to make predictions.

For model assessment, it is convenient to augment your fitted object by adding predictions and probabilities using augment().

Instructions
100 XP
Fit the workflow to the train data.
Augment the lr_fit object using the test data to get it ready for assessment.

In [None]:
# Fit the workflow to the train data
lr_fit <-
  fit(lr_workflow, data = train)

# Augment the lr_fit object
lr_aug <-
  augment(lr_fit, new_data = test)

lr_aug

# Customize your model assessment
Creating custom assessment functions is quite convenient when iterating through various models. The metric_set() function from the yardstickpackage can help you to achieve this.

Define a function that returns roc_auc, accuracy, sens(sensitivity) and specificity spec(specificity) and use it to assess your model.

The augmented data frame lr_augis already loaded and ready to go.

Instructions
100 XP
Define a custom assessment function that returns roc_auc, accuracy, sens, and spec.
Assess your model using your new function on lr_augto obtain the metrics you just chose.

In [None]:
# Define a custom assessment function
class_evaluate <- metric_set(roc_auc, accuracy, sens, spec)

# Assess your model using your new function
class_evaluate(lr_aug, truth = Attrition,
               estimate = .pred_class,
               .pred_No)

# Plain recipe
Using the attrition_num dataset with all numerical data about employees who have left the company, you want to build a model that can predict if an employee is likely to stay, using Attrition, a binary variable coded as a factor. To get started, you will define a plain recipe that does nothing other than define the model formula and the training data.

The attrition_numdata, the logistic regression lr_model, the user-defined class-evaluate() function, and the trainand test splits are loaded for you.

Instructions
100 XP
Create a plain recipe defining only the model formula.

In [None]:
# Create a plain recipe defining only the model formula
lr_recipe_plain <-
  recipe(Attrition ~., data = train)

lr_workflow_plain <- workflow() %>%
  add_model(lr_model) %>%
  add_recipe(lr_recipe_plain)
lr_fit_plain <- lr_workflow_plain %>%
  fit(train)
lr_aug_plain <-
  lr_fit_plain %>% augment(test)
lr_aug_plain %>% class_evaluate(truth = Attrition,
                 estimate = .pred_class,.pred_No)

# Box-Cox transformation
Using the attrition_num dataset with all numerical data about employees who have left the company, you want to build a model that can predict if an employee is likely to stay, using Attrition, a binary variable coded as a factor. To get the features to behave nearly normally, you will create a recipe that implements the Box-Cox transformation.

The attrition_numdata, the logistic regression lr_model, the user-defined class-evaluate() function, and the trainand test splits are loaded for you.

Instructions
100 XP
Create a recipe that uses Box-Cox to transform all numeric features, including the target.

In [None]:
# Create a recipe that uses Box-Cox to transform all numeric features
lr_recipe_BC <-
  recipe(Attrition ~., data = train) %>%
  step_BoxCox(all_numeric())

lr_workflow_BC <- workflow() %>%
  add_model(lr_model) %>%
  add_recipe(lr_recipe_BC)
lr_fit_BC <- lr_workflow_BC %>%
  fit(train)
lr_aug_BC <-
  lr_fit_BC %>% augment(test)
lr_aug_BC %>% class_evaluate(truth = Attrition,
                 estimate = .pred_class,.pred_No)

# Yeo-Johnson transformation
Using the attrition_num dataset with all numerical data about employees who have left the company, you want to build a model that can predict if an employee is likely to stay, using Attrition, a binary variable coded as a factor. To get the features to behave nearly normally, you will create a recipe that implements the Yeo-Johnson transformation.

The attrition_numdata, the logistic regression lr_model, the user-defined class-evaluate() function, and the trainand test splits are loaded for you.

Instructions
100 XP
Create a recipe that uses Yeo-Johnson to transform all numeric features, including the target.

In [None]:
# Create a recipe that uses Yeo-Johnson to transform all numeric features
lr_recipe_YJ <-
  recipe(Attrition ~., data = train) %>%
  step_YeoJohnson(all_numeric())

lr_workflow_YJ <- workflow() %>%
  add_model(lr_model) %>%
  add_recipe(lr_recipe_YJ)
lr_fit_YJ <- lr_workflow_YJ %>%
  fit(train)
lr_aug_YJ <-
  lr_fit_YJ %>% augment(test)
lr_aug_YJ %>% class_evaluate(truth = Attrition,
                 estimate = .pred_class,.pred_No)

# Baseline
Continuing with the attrition_num dataset, you will create a baseline with a plain recipe to assess the effects of additional feature engineering steps. The attrition_numdata, the logistic regression lr_model, the user-defined class-evaluate() function, and the trainand test splits have already been loaded for you.

Instructions
100 XP
Bundle the model and recipe into a workflow.
Augment the fitted workflow to get it ready for assessment.

In [None]:
lr_recipe_plain <- recipe(Attrition ~., data = train)

# Bundle the model and recipe
lr_workflow_plain <- workflow() %>%
  add_model(lr_model) %>%
  add_recipe(lr_recipe_plain)
lr_fit_plain <- lr_workflow_plain %>%
  fit(train)

# Augment the fit workflow
lr_aug_plain <- lr_fit_plain %>%
  augment(test)
lr_aug_plain %>%
  class_evaluate(truth = Attrition,estimate = .pred_class,
                 .pred_No)

# step_poly()
Now that you have a baseline, you can compare your model's performance if you add a polynomial transformation to all numerical values.

The attrition_numdata, the logistic regression lr_model, the user-defined class-evaluate() function, and the trainand test splits have already been loaded for you.

Instructions
100 XP
Add a polynomial transformation to all numeric predictors.
Fit workflow to the train data.

In [None]:
lr_recipe_poly <-
  recipe(Attrition ~., data = train) %>%

# Add a polynomial transformation to all numeric predictors
  step_poly(all_numeric_predictors())

lr_workflow_poly <- workflow() %>%
  add_model(lr_model) %>%
  add_recipe(lr_recipe_poly)

# Fit workflow to the train data
lr_fit_poly <- lr_workflow_poly %>% fit(train)
lr_aug_poly <- lr_fit_poly %>% augment(test)
lr_aug_poly %>% class_evaluate(truth = Attrition, estimate = .pred_class,.pred_No)

# step_percentile()
How would applying a percentile transformation to your numeric variables affect model performance? Try it!

The attrition_numdata, the logistic regression lr_model, the user-defined class-evaluate() function, and the trainand test splits have already been loaded for you.

Instructions
100 XP
Apply a percentile transformation to all numeric predictors.

In [None]:
# Add percentile tansformation to all numeric predictors
lr_recipe_perc <-
  recipe(Attrition ~., data = train) %>%
  step_percentile(all_numeric_predictors())
lr_workflow_perc <-
  workflow() %>%
  add_model(lr_model) %>%
  add_recipe(lr_recipe_perc)
lr_fit_perc <- lr_workflow_perc %>% fit(train)
lr_aug_perc <- lr_fit_perc %>% augment(test)
lr_aug_perc %>% class_evaluate(truth = Attrition,
                 estimate = .pred_class,.pred_No)

# Who's staying?
It's time to practice combining several transformations to the attrition_num data. First, normalize or near-normalize numeric variables by applying a Yeo-Johnson transformation. Next, transform numeric predictors to percentiles, create dummy variables, and eliminate features with near zero variance.

Instructions
100 XP
Apply a Yeo-Johnson transformation to all numeric variables.
Transform all numeric predictors into percentiles.
Create dummy variables for all nominal predictors.

In [None]:
lr_recipe <- recipe(Attrition ~., data = train) %>%

# Apply a Yeo-Johnson transformation to all numeric variables
  step_YeoJohnson(all_numeric()) %>%

# Transform all numeric predictors into percentiles
  step_percentile(all_numeric_predictors()) %>%

# Create dummy variables for all nominal predictors
  step_dummy(all_nominal_predictors())

lr_workflow <- workflow() %>% add_model(lr_model) %>% add_recipe(lr_recipe)
lr_fit <- lr_workflow %>% fit(train)
lr_aug <- lr_fit %>% augment(test)
lr_aug %>% class_evaluate(truth = Attrition, estimate = .pred_class,.pred_No)

# Prepping the stage
You are going to explore the attrition_num dataset from the point of view of PCA to understand if it is feasible to reduce dimensionality while preserving most information. Start by creating a recipe that filters our near-zero variance features, normalizes the data, and implements PCA.

The attrition_num dataset is already loaded for you.

Instructions
100 XP
Remove possible near-zero variance features.
Normalize all numeric data.
Apply PCA.
Access the names of the output elements by preparing the recipe.

In [None]:
pc_recipe <- recipe(~., data = attrition_num) %>%

# Remove possible near-zero variance features
  step_nzv(all_numeric()) %>%

# Normalize all numeric data
  step_normalize(all_numeric()) %>%

# Apply PCA
  step_pca(all_numeric())

# Access the names of the output elements by preparing the recipe
pca_output <- names(pc_recipe)
names(pca_output)

# Percent of variance explained
From the pca_output, you can retrieve the standard deviation explained by each principal component. Then, use these values to compute the variance explained and the cumulative variance explained, and glue together these values into a tibble.

The pca_output object is loaded for you.

Instructions
100 XP
Calculate percentage of variance explained leveraging the standard deviation vector.
Create a tibble with principal components, variance explained and cumulative variance explained.

In [None]:
sdev <- pca_output$steps[[3]]$res$sdev
# Calculate percentage of variance explained
var_explained <- sdev^2 / sum(sdev^2)

# Create a tibble with principal components, variance explained and cumulative variance explained
PCA = tibble(PC = 1:length(sdev), var_explained = var_explained,
       cumulative = cumsum(var_explained))

# Visualizing variance explained
With all the calculations under your belt, it always comes in handy to represent our data visually. You will now create a column plot showing variance explained by principal component.

The variable_explained vector you created in the last exercise is available, and the ggplot() theme_() is set to classic.

Instructions
100 XP
Use the information in the PCA tibble to create a column plot of variance explained.

In [None]:
PCA = tibble(PC = 1:length(sdev), var_explained = var_explained,
       cumulative = cumsum(var_explained))

# Use the information in the PCA tibble to create a column plot of variance explained
PCA %>% ggplot(aes(x = PC, y = var_explained)) +
  geom_col(fill = "steelblue") +
  xlab("Principal components") +
  ylab("Variance explained")

# Investigating education field
Education field is a feature of the attrition dataset. You are interested in using it as a predictor for churning. Start by taking a look at the factor values. The dataset is already loaded.

Instructions
100 XP
Select EducationField from the attrition dataset.
Print a frequency table of the factor values in EducationField.

In [None]:
attrition %>%
# Select education field
  select(EducationField) %>%

# Print a frequency table of factor values
  table()

# Into the matrix
You identified six distinct values for EducationField. But you suspect that others might show up as you run the model on new data. To prepare for this, you will create a hash index with 50 terms. The textrecipes package, the attrition_train, and the attrition_test splits are already loaded.

Instructions
100 XP
Add a step to the recipe that generates a dummy_hash index for EducationField.
Prepare the recipe.
Bake the prepared recipe.
Bind the baked recipe table and the EducationField values into one table and print the first 7 rows, as well as columns 1 and 18 to 20.

In [None]:
recipe <- recipe(~EducationField, data = attrition_train) %>%
# Add a step to the recipe that generates a dummy_hash index for EducationField
  step_dummy_hash(EducationField, prefix = NULL, signed = FALSE, num_terms = 50L)

# Prepare the recipe
object <- recipe %>%
  prep()

# Bake the prepped recipe
baked <- bake(object, new_data = attrition_test)

# Bind the baked recipe table and the EducationField values into one table
bind_cols(attrition_test$EducationField, baked)[1:7,c(1,18:20)]

# Visualizing the hashing
It is often helpful to look at a visual representation of your data. The baked tibble is loaded with the hash indexes representing the EducationField factor. You can explore all or a portion of this dataset as a matrix to identify patterns or detect potential errors.

The plot.matrix package has already been loaded for you.

Instructions
100 XP
Convert the baked tibble to a matrix.
Plot the attrition_hash matrix.

In [None]:
# Convert the baked tibble to a matrix
attrition_hash <- as.matrix(baked)[1:50,]

# Plot the attrTheTition_hash matrix
plot(attrition_hash,
     col = c("white","steelblue"),
     key = NULL,
     border = NA)

# Setting up your workflow
You want to investigate if JobRole alone can be a predictor for Attrition. Given that JobRole is a factor, you plan to use a Bayes encoder to represent it numerically in your model.

The embed package and the corresponding test and train splits from the attrition dataset are loaded in your workspace.

Instructions
100 XP
Create recipe using the Bayes encoder.
Bundle the model and recipe with a workflow.

In [None]:
lr_model <- logistic_reg()
# Create recipe using the Bayes encoder
lr_recipe_glm <-
  recipe(Attrition ~ JobRole, data = train) %>%
  step_lencode_bayes(JobRole, outcome = vars(Attrition))

# Bundle with workflow
lr_workflow_glm <-
  workflow() %>%
  add_model(lr_model) %>%
  add_recipe(lr_recipe_glm)

lr_workflow_glm


# Fitting, augmenting, and assessing
You are ready to fit the workflow to the training data and assess its performance using the testing split.

The embed package, the lr_workflow_glm object, and the corresponding test and train splits are in your workspace.

Instructions
100 XP
Fit the workflow to the training set.
Augment the fit using the test split.
Assess the model.



In [None]:
# Fit the workflow to the training set
lr_fit_glm <- lr_workflow_glm %>%
  fit(train)

# Augment the fit using the test split
lr_aug_glm <- lr_fit_glm %>%
  augment(test)

# Assess the model
glm_metrics <- lr_aug_glm %>%
  class_evaluate(truth = Attrition,
                 estimate = .pred_class,
                 .pred_Yes)

glm_metrics

# Binding models together
When running several models, it is useful to summarize your results for comparison, both as a tibble and as a parallel coordinates chart. The glm_metrics object that your created is loaded in the workspace, as well as the corresponding bayes_metrics and mixed_metrics. The GGally package is ready for you.

Instructions
100 XP
Bind all three models by rows.
Create a parallel coordinates plot to compare the models' performance in both metrics.

In [None]:
model <- c("glm", "glm","bayes","bayes", "mixed", "mixed")

# Bind models by rows
models <- bind_rows(glm_metrics,bayes_metrics, mixed_metrics)%>%
  add_column(model = model)%>%
  select(-.estimator) %>%
  spread(model,.estimate)

models

# Create a parallel coordinates plot
ggparcoord(models,
           columns = 2:4, groupColumn = 1,
           scale="globalminmax",
           showPoints = TRUE)

# Create a workflow
As you keep investigating attrition, it is natural to build a model that takes all the available predictors, hopping to get a highly accurate prediction. Let's see how it goes.

Instructions
100 XP
Create a recipe to predict Attrition based on all features.
Bundle the model and recipe in a workflow.

In [None]:
lr_model <- logistic_reg()

# Create a recipe to predict Attrition based on all features
lr_recipe <-
  recipe(Attrition ~.,
         data = train)

# Bundle the model and recipe in a workflow
lr_workflow <-
  workflow() %>%
  add_model(lr_model) %>%
  add_recipe(lr_recipe)

lr_workflow

# Fit and augment
Now is time to fit and augment our model. You are using all available variables and have some expectations on the results. Can this relatively larger model outperform simpler ones? Take a look at the lr_augment object to see how your model is performing.

Instructions
100 XP
Augment the fit object to assess the model.

In [None]:
lr_fit <- lr_workflow %>%
  fit(test)

# Augment the fit object to assess the model
lr_aug <- lr_fit %>%
  augment(test)

lr_aug %>% class_evaluate(truth = Attrition,
                          estimate = .pred_class,
                          .pred_No)
lr_aug

# Which is the main predictor?
Which is the main predictor?
You've got a remarkable prediction, but what were the main predictors? How can you make sense of the model so that you can go beyond the raw results? Machine learning models are often criticized for their lack of interpretability. However, variable importance rankings shed some light on the relevance of your chosen features with the outcome. So let's investigate variable importance and go from there.

Instructions
100 XP
Create a variable importance chart.

In [None]:
lr_fit <- lr_workflow %>%
  fit(test)

lr_aug <- lr_fit %>%
  augment(test)

lr_aug %>% class_evaluate(truth = Attrition,
                          estimate = .pred_class,
                          .pred_No)

# Create a variable importance chart
lr_fit %>%
  extract_fit_parsnip() %>%
  vip(aesthetics = list(fill = "steelblue"), num_features = 5)

# Sifting through variable importance
The attrition dataset contains 839 observations and 30 predictors for "Attrition." You are interested in exploring the trade-off between the performance of a model that uses all available predictors versus a reduced model based on a few informative variables.

In this exercise, you'll fit a model and have a look at the variable importance of this fitted model. In the following exercise, you'll assess model performance using this model compared to using a reduced model.

The train and test splits and the vip() package are available in your environment along with a predeclared logistic regression model.

Instructions
100 XP
Create a recipe that models Attrition using all predictors.
Fit the workflow to the training data.
Use the fit_full object to graph the variable importance of your model.
Apply the extract_fit_parsnip() function before vip() to feed it the required information.

In [None]:
# Create a recipe that models Attrition using all the predictors
recipe_full <- recipe(Attrition ~ ., data = train)

workflow_full <- workflow() %>%
  add_model(model) %>%
  add_recipe(recipe_full)

# Fit the workflow to the training data
fit_full <- workflow_full %>%
  fit(data = train)

# Use the fit_full object to graph the variable importance of your model. Apply extract_fit_parsnip() function before vip()
fit_full %>% extract_fit_parsnip() %>%
  vip(aesthetics = list(fill = "steelblue"))

# Assessing model performance using all available predictors
In order to assess the performance of your reduced model, it is important to set a benchmark. Measure your full model's performance to understand the trade-off of a reduced model. Recall the variable importance chart that you created in an earlier exercise.

The train and test splits together with your user-defined function class_evaluate() are loaded in your environment. Your fitted model has been saved as fit_full. Thetrain and test splits together with your user-defined function class_evaluate() are loaded in your environment.

Instructions
100 XP
Create an augmented object from the fitted full model.
Assess model performance using class_evaluate.

In [None]:
# Create an augmented object from the fitted full model
aug_full <-
  fit_full %>%
  augment(test)

# Assess model performance using class_evaluate
aug_full %>% class_evaluate(truth = Attrition,
               estimate = .pred_class,
               .pred_Yes)

# Building a reduced model
Variable importance analysis helped you identify the most predictive features from the attrition dataset. Based on it, you will build a drastically reduced model with only three variables: OverTime, DistanceFromHome, and NumCompaniesWorked and compare its performance to the full model baseline. The metrics you estimated for the full model are stored in aug_full.

All data, along with the train and test splits, is available in your environment.

Instructions 1/3
35 XP
2
3
Create a recipe using the formula syntax that includes only OverTime and DistanceFromHome as predictors.
Bundle the recipe with your model.

In [None]:
# Create a recipe using the formula syntax that includes only OverTime, DistanceFromHome and NumCompaniesWorked as predictors
recipe_reduced <-
  recipe(Attrition ~ OverTime + DistanceFromHome + NumCompaniesWorked, data = train)

# Bundle the recipe with your model
workflow_reduced <-
  workflow() %>%
  add_model(model) %>%
  add_recipe(recipe_reduced)

Instructions 2/3
35 XP
3
Augment the fitted workflow for analysis using the test data.

In [None]:
# Create a recipe using the formula syntax that includes only OverTime, DistanceFromHome and NumCompaniesWorked as predictors
recipe_reduced <-
  recipe(Attrition ~ OverTime + DistanceFromHome + NumCompaniesWorked, data = train)

# Bundle the recipe with your model
workflow_reduced <-
  workflow() %>%
  add_model(model) %>%
  add_recipe(recipe_reduced)

fit_reduced <-
  workflow_reduced %>%
  fit(data = train)

# Augment the fitted workflow for analysis using the test data
aug_reduced <-
  fit_reduced %>%
  augment(test)

Instructions 3/3
0 XP
Evaluate your reduced model for comparison.

In [None]:
# Create a recipe using the formula syntax that includes only OverTime, DistanceFromHome and NumCompaniesWorked as predictors
recipe_reduced <-
  recipe(Attrition ~ OverTime + DistanceFromHome + NumCompaniesWorked, data = train)

# Bundle the recipe with your model
workflow_reduced <-
  workflow() %>%
  add_model(model) %>%
  add_recipe(recipe_reduced)

fit_reduced <-
  workflow_reduced %>%
  fit(data = train)

# Augment the fitted workflow for analysis using the test data
aug_reduced <-
  fit_reduced %>%
  augment(test)

full_model <- aug_full %>% class_evaluate(truth = Attrition,
                            estimate = .pred_class, .pred_Yes)

# Evaluate your reduced model for comparison
reduced_model <- aug_reduced %>% class_evaluate(truth = Attrition,
                                             estimate = .pred_class, .pred_Yes)

bind_cols(full_model,reduced_model) %>%
  select(1,3,6) %>%
  rename(metric = .metric...1, full_model = .estimate...3,
         reduced_model = .estimate...6)

# Manual regularization with Lasso
The attrition dataset has 30 variables. Your Human Resources department asks you to build a model that is easy to interpret and maintain. They specifically want to reduce the number of features so that your model is as interpretable as possible.

In this exercise, you'll use Lasso to reduce the number of variables in your model automatically. In this first attempt, you will manually input a penalty and observe the model's behavior.

trainand test data, and a basic recipe are already loaded for you.

Instructions
100 XP
Set your logistic regression model to use the glmnet engine.
Set arguments to run Lasso with a penalty of 0.06.

In [None]:
model_lasso_manual <- logistic_reg() %>%

# Set the glmnet engine for your logistic regression model
  set_engine("glmnet") %>%

# Set arguments to run Lasso with a penalty of 0.06
  set_args(mixture = 1, penalty = 0.06)

workflow_lasso_manual <- workflow() %>%
  add_model(model_lasso_manual) %>%
  add_recipe(recipe)

fit_lasso_manual <- workflow_lasso_manual %>%
  fit(train)

tidy(fit_lasso_manual)