# Background
Exploratory Data Analysis (EDA) is the initial and an important phase of data analysis/predictive modeling. During this process, analysts/modelers will have a first look of the data, and thus generate relevant hypotheses and decide next steps. However, the EDA process could be a hassle at times. This R package aims to automate most of data handling and visualization, so that users could focus on studying the data and extracting insights.

# Installation

The package can be installed directly from CRAN.

In [None]:
install.packages("DataExplorer")

# Examples

The package is extremely easy to use. Almost everything could be done in one line of code. Please refer to the package manuals for more information. You may also find the package vignettes here.

Report

# There are 3 main goals for DataExplorer:

Exploratory Data Analysis (EDA)

Feature Engineering

Data Reporting

install.packages("nycflights13")

library(nycflights13)

There are 5 datasets in this package:

airlines

airports

flights

planes

weather

If you want to quickly visualize the structure of all, you may do the following:

In [None]:
library(DataExplorer)
data_list <- list(airlines, airports, flights, planes, weather)
plot_str(data_list)

# Exploratory Data Analysis

Exploratory data analysis is the process to get to know your data, so that you can generate and test your hypothesis. Visualization techniques are usually applied.

To get introduced to your newly created dataset:

asuume data name is   

### final_data

In [None]:
introduce(final_data)

## Missing values
\
Real-world data is messy, and you can simply use plot_missing function to visualize missing profile for each feature.

In [None]:
plot_missing(final_data)

# Bar Charts

To visualize frequency distributions for all discrete features:

In [None]:
plot_bar(final_data)

# Histograms

To visualize distributions for all continuous features:

In [None]:
plot_histogram(final_data)

# Correlation Analysis

To visualize correlation heatmap for all non-missing features

In [None]:
plot_correlation(na.omit(final_data), maxcat = 5L)

# Boxplots

Suppose you would like to build a model to predict arrival delays, you may visualize the distribution of all continuous features based on arrival delays with a boxplot:

In [None]:
## Reduce data size for demo purpose
arr_delay_df <- final_data[, c("arr_delay", "month", "day", "hour", "minute", "dep_delay", "distance", "year_planes", "seats")]

## Call boxplot function
plot_boxplot(arr_delay_df, by = "arr_delay")

# Scatterplots

An alternative visualization is scatterplot. For example:

In [None]:
arr_delay_df2 <- final_data[, c("arr_delay", "dep_time", "dep_delay", "arr_time", "air_time", "distance", "year_planes", "seats")]

plot_scatterplot(arr_delay_df2, by = "arr_delay", sampled_rows = 1000L)

# Replace missing values

Missing values may have meanings for a feature. 

Other than imputation methods, we may also set them to some logical values. For example, for discrete features, we may want to group missing values to a new category. For continuous features, we may want to set missing values to a known number based on existing knowledge.

In DataExplorer, this can be done by set_missing. The function automatically matches the argument for either discrete or continuous features, i.e., if you specify a number, all missing continuous values will be set to that number. If you specify a string, all missing discrete values will be set to that string. If you supply both, both types will be set.

In [None]:
## Return data.frame
final_df <- set_missing(final_data, list(0L, "unknown"))

# Dummify data (one hot encoding)

To transform the data into binary format (so that ML algorithms can pick it up), dummify will do the job. The function preserves original data structure, so that only eligible discrete features will be turned into binary format.

In [None]:
plot_str(
  list(
    "original" = final_data,
    "dummified" = dummify(final_data, maxcat = 5L)
  )
)

# Drop features

After viewing the feature distribution, you often want to drop features that are insignificant. For example, features like dst_dest has mostly one value, and it doesn’t provide any valuable information. You can use drop_columns to quickly drop features. The function takes either names or column indices.

In [None]:
identical(
  drop_columns(final_data, c("dst_dest", "tzone_dest")),
  drop_columns(final_data, c(36, 37))
)