Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
74 lines (53 sloc) 2.39 KB
---
title: "3 - Data cleaning and variable transformation"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{3 - Data cleaning and variable transformation}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
## Introduction
In _bllflow_ data cleaning is loosely defined as modifying rows (observations) and variable transformation as modifying columns (variables). Data preparation refers to both these procedures. Data preparation functions come in two forms:
1) Functions where the data and variables are specfified as attributes within the function. _bllflow_ add to the functions that already exist in the `sjmisc` package.
1) _bllflow_ object functions where the Model Specification Workbook specifies which data is cleaned and variables transformed. These _bllflow_ 'wrapper' functions are available for selected functions in the `sjmisc` package and for all _bllflow_ objects.
The effects of all data preparation functions can be logged and printed. Tranformations can be save as a Predictive Modelling Markup Language (PMML) file to faciliate predictive model deployment.
## Data cleaning
Three data cleaning functions are supported.
For continuous variables:
1) `clean.Min()` -- deletes observations below a defined value.
1) `clean.Max()` -- deletes observations abovea a defined value.
For categorical variables:
1) `recodeWithTable()` TODO.
### Example 1: `clean.Max` and `clean.Min` using BLLFlow model (and MSW variableDetails)
```{r}
# load libraries and pbc data (from survival)
library(survival)
data(pbc)
library(bllflow)
# read the Model Specification Workbook
variables <- read.csv(file.path(getwd(), '../inst/extdata/PBC-variables.csv'))
variableDetails <- read.csv(file.path(getwd(), '../inst/extdata/PBC-variableDetails.csv'))
# initialize BLLFlow - create the BLLFlow object
pbcModel <- BLLFlow(pbc, variables, variableDetails)
cleanedPbcModel <- clean.Max(pbcModel, print = TRUE)
cleanedPbcModel <- clean.Min(cleanedPbcModel, print = TRUE)
cleanedPbcModel$metaData$log # to print the entire log.
```
## Data tranformation
Supported functions: <font color="red">TODO</font>: describe these functions.
- centring
- normalization and standardization
- restrictive cubic splines
- interaction terms
- dummy variables
- recoding
## Logging
TODO
## Transformations to PMML
TODO
You can’t perform that action at this time.