-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
108 lines (75 loc) · 3.12 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
output: github_document
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# dqtesting
The goal of the package is to provide an easy toolset for data quality testing. The main function `perform_dqtest` returns a list containing various results from an univariate DQ test. Moreover functions are provided that allow a very easy interface to the Local Outlier Factor Algorithm for multivariate outlier detection (therefore automated hyperparameter tuning).
## Web Interface
There is a web application to the package: [Link](https://esommer.shinyapps.io/dqtesting/)
## Installation
You can install the released version of dqtesting from Github in R with the following line of code:
``` {r, eval = FALSE}
# install.packages("devtools")
devtools::install_github("EmanuelSommer/dqtesting")
```
## Some examples
The `dummy_data` is a dummy data set contained in the package.
```{r data}
library(dqtesting)
# quick overview of the dummy data
str(dummy_data)
```
These are basic examples which show you how to solve some common problems:
Given: The data should not contain missing values, the data should have a certain range or contain categories and moreover some values can be excluded as they represent special cases. To check these data quality requirements the function `perform_dqtest` can be used in the following way.
```{r example}
library(dqtesting)
### first variable: char1 with allowed categories "Random Forest", "Linear Regression" and "SVM", "Neural Networks" should be excluded.
dq_char <- perform_dqtest(dummy_data$char1,
categories = c("Random Forest","Linear Regression","SVM"),
exclude_values = "Neural Networks")
# access the absolute and relative amount of NA's
dq_char$abs_na
dq_char$rel_na
# access the absolute and relative amount of excluded values
dq_char$exclusions$abs_excluded
dq_char$exclusions$rel_excluded
# access a statistical summary
dq_char$stat_summary
# access the categories check
dq_char$cat_check
# access a visualisation
dq_char$barplot
### second variable: num2 should be non negative
dq_num <- perform_dqtest(dummy_data$num2, range_min = 0)
# access the 0.4 quantile of the vector
dq_num$stat_summary$quantiles11[4]
# access the minimum of the vector
dq_num$stat_summary$quantiles11[1]
# access the range check
dq_num$range_check
# access a different visualisation
dq_num$boxplot
### third variable: datetime2
dq_datetime <- perform_dqtest(dummy_data$datetime2)
# access a visualisation about the weekdays
dq_datetime$hist_wday
```
These were just a few examples of this very flexible function.
If multivariate outliers should be detected too, the package provides some useful functions:
```{r example2}
# perform the LOF Algorithm
lof_list <- lof_fun(dummy_data[,c(3,4)])
# extract the most suspicious values
extract_rows_score(dummy_data[,c(3,4)],lof_list,threshold = 1.5)
# visualize the results
lof_vis(dummy_data[,c(3,4)],lof_list)
```
Detailed help pages are available for all functions. For example just type `?perform_dqtest` into the console.
Have fun :)