<a href="https://colab.research.google.com/github/Pattiecodes/DataCamp_As.AIEng/blob/main/Module_4_Feature_Engineering_in_R_R.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# --- Module Start ---



# A tentative model
You are handed a data set with measures of the gravitational force between two bodies at different distances and are challenged to build a simple model to predict such force given a specific distance. Initially, you want to stick to simple linear regression. The data consist of 120 pairs of distance and force, and is loaded for you as newton.

Instructions
100 XP
Build a linear model for the newton data using the linear model from base R function and assign it to lr_force.
Create a new data frame df by binding the prediction values to the original newton data.
Generate a scatterplot of force versus distance using ggplot().
Add a regression line to the scatterplot with the fitted values.

In [None]:
# Build a linear model for the newton the data and assign it to lr_force
lr_force <- lm(force ~ distance, data = newton)

# Create a new data frame by binding the prediction values to the original data
df <- newton %>% bind_cols(lr_pred = predict(lr_force))

# Generate a scatterplot of force vs. distance
df %>%
  ggplot(aes(x = distance, y = force)) +
  geom_point() +
# Add a regression line with the fitted values
  geom_line(aes(y = lr_pred), color = "blue", lwd = .75) +
  ggtitle("Linear regression of force vs. distance") +
  theme_classic()


# Manually engineering a feature
After doing some research with your team, you recall that the gravitational force of attraction between two bodies obeys Newton's formula:


$$
F = G \frac{m_1 m_2}{r^2}
$$.

You can't use the formula directly because the masses are unknown, but you can fit a regression model of force as a function of inv_square_distance. The augmented dataset df you built in the previous exercise has been loaded for you.

Instructions
100 XP
Create a new variable inv_square_distance defined as the reciprocal of the squared distance and add it to the df data frame.
Build a simple regression model using lm() of force versus inv_square_distance and save it as lr_force_2.
Bind your predictions to df_inverse.

In [None]:
# Create a new variable inv_square_distance
df_inverse <- df %>% mutate(inv_square_distance = 1/distance^2)

# Build a simple regression model
lr_force_2 <- lm(force ~ inv_square_distance, data = df_inverse)

# Bind your predictions to df_inverse
df_inverse <- df_inverse %>% bind_cols(lr2_pred = predict(lr_force_2))

df_inverse %>% ggplot(aes(x = distance, y = force)) +
  geom_point() +
  geom_line(aes(y = lr2_pred), col = "blue", lwd = .75) +
  ggtitle("Linear regression of force vs. inv_square_distance") +
  theme_classic()

# Setting up your data for analysis
You will look at a version of the nycflights13 dataset, loaded as flights. It contains information on flights departing from New York City. You are interested in predicting whether or not they will arrive late to their destination, but first, you need to set up the data for analysis.

After discussing our model goals with a team of experts, you selected the following variables for your model: flight, sched_dep_time, dep_delay, sched_arr_time, carrier, origin, dest, distance, date, arrival.

You will also mutate() the date using as.Date() and convert character type variables to factors.

Lastly, you will split the data into train and test datasets.

Instructions
100 XP
Transform all character-type variables to factors.
Split the flights data into test and train sets.

In [None]:
flights <- flights %>%
  select(flight, sched_dep_time, dep_delay, sched_arr_time, carrier, origin, dest, distance, date, arrival) %>%

# Tranform all character-type variables to factors
  mutate(date = as.Date(date), across(where(is.character), as.factor))

# Split the flights data into test and train sets
set.seed(246)
split <- flights %>% initial_split(prop = 3/4, strata = arrival)
test <- testing(split)
train <- training(split)

test %>% select(arrival) %>% table() %>% prop.table()
train %>% select(arrival) %>% table() %>% prop.table()

# Building a workflow
With your data ready for analysis, you will declare a logistic_model() to predict whether or not they will arrive late.

You assign the role of "ID" to the flight variable to keep it as a reference for analysis and debugging. From the date variable, you will create new features to explicitly model the effect of holidays and represent factors as dummy variables.

Bundling your model and recipe() together using workflow()will help ensure that subsequent fittings or predictions will implement consistent feature engineering steps.

Instructions
100 XP
Assign an "ID" role to flight.
Bundle the model and the recipe into a workflow object.
Fit lr_workflow to the test data.
Tidy the fitted workflow.

In [None]:
lr_model <- logistic_reg()

# Assign an "ID" role to flight
lr_recipe <- recipe(arrival ~., data = train) %>% update_role(flight, new_role = "ID") %>%
  step_holiday(date, holidays = timeDate::listHolidays("US")) %>% step_dummy(all_nominal_predictors())

# Bundle the model and the recipe into a workflow object
lr_workflow <- workflow() %>% add_model(lr_model) %>% add_recipe(lr_recipe)
lr_workflow

# Fit lr_workflow workflow to the test data
lr_fit <- lr_workflow %>% fit(data = test)

# Tidy the fitted workflow
tidy(lr_fit)

# Identifying missing values
Attrition is a critical issue for corporations, as losing an employee implies not only the cost of recruiting and training a new one, but constitutes a loss in tacit knowledge and culture that is hard to recover.

The attritiondataset has information on employee attrition including Age, WorkLifeBalance, DistanceFromHome, StockOptionLevel, and 27 others. Before continuing with your analysis, you want to detect any missing variables.

The package naniar and the attritiondataset are already loaded for you.

In [None]:
# Explore missing data on the attrition dataset
vis_miss(attrition)

Instructions 2/2

Select the variables with missing values and visualize only those.

In [None]:
# Select the variables with missing values and rerun the analysis on those variables.
attrition %>%
  select("BusinessTravel", "DistanceFromHome",
         "StockOptionLevel", "WorkLifeBalance") %>%
  vis_miss()

#Imputing missing values and creating dummy variables
After detecting missing values in the attrition dataset and determining that they are missing completely at random (MCAR), you decide to use K Nearest Neighbors (KNN) imputation. While configuring your feature engineering recipe, you decide to create dummy variables for all your nominal variables and update the role of the ...1 variable to "ID" so you can keep it in the dataset for reference, without affecting your model.

Instructions
100 XP
Update the role of ...1 to "ID".
Impute values to all predictors where data are missing.
Create dummy variables for all nominal predictors.

In [None]:
lr_model <- logistic_reg()

lr_recipe <-
  recipe(Attrition ~., data = train) %>%

# Update the role of "...1" to "ID"
  update_role(...1, new_role = "ID" ) %>%

# Impute values to all predictors where data are missing
  step_impute_knn(all_predictors()) %>%

# Create dummy variables for all nominal predictors
  step_dummy(all_nominal_predictors())

lr_recipe

# Fitting and assessing the model
Now that you have addressed missing values and created dummy variables, it is time to assess your model's performance!

The attritiondataset, along with the testand train splits, the lr_recipe and your declared logistic_model() are all loaded for you.

Instructions
100 XP
Bundle model and recipe in workflow.
Fit workflow to the train data.
Generate an augmented data frame for performance assessment.

In [None]:
# Bundle model and recipe in workflow
lr_workflow <- workflow() %>%
  add_model(lr_model) %>%
  add_recipe(lr_recipe)

# Fit workflow to the train data
lr_fit <- fit(lr_workflow, data = train)

# Generate an augmented data frame for performance assessment
lr_aug <- lr_fit %>% augment(test)

lr_aug %>% roc_curve(truth = Attrition, .pred_No) %>% autoplot()
bind_rows(lr_aug %>% roc_auc(truth = Attrition, .pred_No),
          lr_aug %>% accuracy(truth = Attrition, .pred_class))

# Predicting hotel bookings
You just got a job at a hospitality research company, and your first task is to build a model that predicts whether or not a hotel stay will include children. To train your model, you will rely on a modified version of the hotel booking demands dataset by Antonio, Almeida, and Nunes (2019). You are restricting your data to the following features:
```
features <- c('hotel', 'adults',
              'children', 'meal',
              'reserved_room_type',
              'customer_type',
              'arrival_date')
```
The data has been loaded for you as hotels, along with its corresponding test and train splits, and the model has been declared as lr_model <- logistic_reg().

You will assess model performance by accuracy and area under the ROC curve or AUC.

Instructions 1/2
50 XP
2
Generate "day of the week", "week" and "month" features.
Create dummy variables for all nominal predictors.

In [None]:
lr_recipe <-
  recipe(children ~., data = train) %>%
# Generate "day of the week", "week" and "month" features

  step_date(arrival_date, features = c("dow", "week", "month")) %>%

# Create dummy variables for all nominal predictors
  step_dummy(all_nominal_predictors())

Instructions 2/2
50 XP
Bundle your model and recipe in a workflow().
Fit the workflow to the training data.

In [None]:
lr_recipe <- recipe(children ~., data = train) %>%

# Generate "day of the week", "week" and "month" features
  step_date(arrival_date, features = c("dow", "week", "month")) %>%

# Create dummy variables for all nominal predictors
  step_dummy(all_nominal_predictors())

# Bundle your model and recipe in a workflow
lr_workflow <-workflow() %>% add_model(lr_model) %>% add_recipe(lr_recipe)

# Fit the workflow to the training data
lr_fit <-  lr_workflow %>% fit(data = train)
lr_aug <- lr_fit %>% augment(test)
bind_rows(roc_auc(lr_aug,truth = children, .pred_children),accuracy(lr_aug,truth = children, .pred_class))

# Normalizing and log-transforming
You are handed a dataset, attrition_num with numerical data about employees who left the company. Features include Age, DistanceFromHome, and MonthlyRate.

You want to use this data to build a model that can predict if an employee is likely to stay, denoted by Attrition, a binary variable coded as a factor. In preparation for modeling, you want to reduce possible skewness and prevent some variables from outweighing others due to variations in scale.

The attrition_numdata and the trainand test splits are loaded for you.

Instructions
100 XP
Normalize all numeric predictors.
Log-transform all numeric features, with an offset of one.

Take Hint (-30 XP)

In [None]:
lr_model <- logistic_reg()

lr_recipe <-
  recipe(Attrition~., data = train) %>%

# Normalize all numeric predictors
  step_normalize(all_numeric_predictors()) %>%

# Log-transform all numeric features, with an offset of one
  step_log(all_numeric_predictors(), offset = 1)

lr_workflow <-
  workflow() %>%
  add_model(lr_model) %>%
  add_recipe(lr_recipe)

lr_workflow

# Fit and augment
With your lr_workflow ready to go, you can fit it to the test data to make predictions.

For model assessment, it is convenient to augment your fitted object by adding predictions and probabilities using augment().

Instructions
100 XP
Fit the workflow to the train data.
Augment the lr_fit object using the test data to get it ready for assessment.

In [None]:
# Fit the workflow to the train data
lr_fit <-
  fit(lr_workflow, data = train)

# Augment the lr_fit object
lr_aug <-
  augment(lr_fit, new_data = test)

lr_aug

# Customize your model assessment
Creating custom assessment functions is quite convenient when iterating through various models. The metric_set() function from the yardstickpackage can help you to achieve this.

Define a function that returns roc_auc, accuracy, sens(sensitivity) and specificity spec(specificity) and use it to assess your model.

The augmented data frame lr_augis already loaded and ready to go.

Instructions
100 XP
Define a custom assessment function that returns roc_auc, accuracy, sens, and spec.
Assess your model using your new function on lr_augto obtain the metrics you just chose.

In [None]:
# Define a custom assessment function
class_evaluate <- metric_set(roc_auc, accuracy, sens, spec)

# Assess your model using your new function
class_evaluate(lr_aug, truth = Attrition,
               estimate = .pred_class,
               .pred_No)