# Exploratory data analysis (EDA)

This notebook contains examples of EDA as well as exercises that will allow you to do some EDA on your own.

# EDA step 1 – the basics

As an example, we will look at the diamonds dataset. We can find the diamonds dataset in the `tidyverse` package, which we will need anyway.

In [None]:
library(tidyverse)


Here is a first look at the data:

In [None]:
head(diamonds, 10)


In [None]:
# ?diamonds
str(diamonds)


### *Exercise*

- What do the different variables in the dataset represent?
- What does a row represent?
- Which variables are categorical and which ones are numerical?
- What are the different data types of the different variables?
- Should we change the data type of any of the variables?

# Next step: Asking questions about your data

## Visualizing variation within a variable

We will first plot the distribution of the categorical variable `cut` using a barplot. (We will throughout this notebook use the standard plotting functionality of R. However, in the next notebook we will se how we can make much prettier and more flexible plots with the ggplot2 package.)

In [None]:
summary(diamonds$cut)
barplot(summary(diamonds$cut))


Alternatively you can also just use the `plot` command on the variable:

In [None]:
plot(diamonds$cut)


We can resize the plots in our notebook if we think the are to big or small, but setting some global options.:

In [None]:
options(repr.plot.width=8, repr.plot.height=5)


In [None]:
plot(diamonds$cut)


The interpretation of this plot is fairly straightforward. For each of the possible cuts in the dataset, it shows how many diamonds that have that cut.

Now let us plot the distribution of the numerical variable `carat`. First, let us use a histogram:

In [None]:
hist(diamonds$carat)


Here the interpretation of the plot requires a little more. The values of the numerical variable `carat` varies from 0 to a little above 5. To get an idea of the distribution of the carat of the different diamonds, we put them into bins with a width of 0.5 carats. That is, all diamonds with a carat ranging from 0 to 0.5 are put in the first bin, diamonds with a carat ranging from 0.5 to 1 are put in the second bin, ..., and so on. Then the histogram shows how many diamonds are in each bin.

Note that, the width of the bins can vary. Moreover, if we change the width of the bins, the plot will look different. There is no right answer to the question of how many bins there should be. Let us try it out:

In [None]:
hist(diamonds$carat, breaks = 4)
hist(diamonds$carat, breaks = 20)
hist(filter(diamonds, carat <= 1)$carat, breaks = 10)

We can also plot the boxplot of a numerical variable in the following way:

In [None]:
boxplot(diamonds$carat)

A box plot requires a bit to interpret. First the thick horizontal line in the middle of the box represent the `median`. The lower horizontal line of the box the 1st quartile and the upper horizontal line of the box the 3rd quartile.

The *interquartile range* `IQR` is the 3rd quartile minus the 1st quartile (i.e. the hight of the box). Now, the top horizontal line is whatever is smallest of "*the 3d quartile + 1.5 times IQR*" and *the maximum value*. The bottom horizontal line is the biggest of "*1st quartile - 1.5 times IQR*" and *the minimum value*.

Every value above the top horizontal line or below the bottom horizontal lines are displayed as circles and are also referred to as *outliers*.

### *Exercise*

Plot the distributions of the variables `color` and `price`.

## Quantifying variation within a variable

Let us now *quantify* the variation within a variable, such as the variables that we saw in the above plots.

### Categorical variables

First, let us quantify what we saw in the bar plot for the categorical variable `cut`:

In [None]:
summary(diamonds$cut)

In [None]:
table(diamonds$cut)

As we see, the `summary` or `table` function can be used to give us the number of cases (diamonds) in each of the categories of cut.

Let us now try to calculate the *mode* of the variable `cut`:

In [None]:
mode(diamonds$cut)

As you can see this is clearly not the mode of the `cut` variable. In R the function `mode` does something completely different, so let us forget about that. Instead, we can use the `max` function on `table(diamonds$cut)` to get the number of times the most frequent value occurred. However, the mode is the most frequent value itself, which we can get with the `which.max` function:

In [None]:
max(table(diamonds$cut))
which.max(table(diamonds$cut))

So the *mode* of the variable `cut` is `Ideal`.

### Numerical variables

For a numerical variable, such as `carat` there are many different descriptive statistics we can use to quantify what we saw in the histogram plot. First, here is how to get the `mean` and the `median`:

In [None]:
mean(diamonds$carat)
median(diamonds$carat)

Quantiles (and quartiles) can be obtained using the summary function (for 1st, 2nd and 3rd quantiles) or the `quantiles` function:

In [None]:
summary(diamonds$carat)

In [None]:
quantile(diamonds$carat, c(0.25, 0.35, 0.5, 0.75))

Note that mean and quantiles also make sense for categorical variables of *Ordinal* type. However, in R we need to transform the categorical variable to a numeric variable first for the classic `median` and `quantile` functions to work:

In [None]:
median(as.numeric(diamonds$cut))
quantile(as.numeric(diamonds$cut), c(0.35, 0.5))

After looking at descriptive statistics for the central tendencies of a distribution, let us look at descriptive statistics for its spread. First we can look at the minimum, the maximum, and the range in the following way:

In [None]:
min(diamonds$carat)
max(diamonds$carat)
range(diamonds$carat)
max(diamonds$carat) - min(diamonds$carat)

Now let us see how to calculate the variance in R:

In [None]:
var(diamonds$carat)
sum((diamonds$carat - mean(diamonds$carat))^2) /
(length(diamonds$carat) - 1)

The standard deviation can be calculated in the following way:

In [None]:
sd(diamonds$carat)
sqrt(var(diamonds$carat))

### *Exercise*

Calculate all the above descriptive statistics for the variables `color` and `price`.

# Variation between two variables

We will first visualize the relationship between two categorical variables using the Mosaic plot:

In [None]:
mosaicplot(diamonds$cut ~ diamonds$color)

The size of each square is proportional to the number of cases with each combination of the two categorical variables. Compare this to the following descriptive statistics:

In [None]:
table(diamonds$color, diamonds$cut)

Now, let us visualize the relationship between to numerical variables using a classic scatterplot:

In [None]:
plot(diamonds$carat, diamonds$price)

There is clearly some sort of relationship here. It does not look completely linear, but we can measure the linear relationship using Pearson’s correlation coefficient:

In [None]:
cor(diamonds$carat, diamonds$price)

The `cor` can also be used to measure other correlation coefficients, see its help page (`?cor`)

Finally, let us look at the relationship between a categorical variable and a numerical variable using a boxplot:

In [None]:
boxplot(diamonds$carat ~ diamonds$cut)

In this plot we can see how the numerical variable (`carat`) is distributed for each of the five values of the categorical variable `cut` and thereby compare them. Is there a significant difference in the distribution between two different values?

### *Exercise*

Try to plot and calculate descriptive statistics for relationships between other pairs of variables from the `diamonds` dataset.

### *Exercise*

Do an Exploratory Data Analysis on the dataset "LasVegasTripAdvisorReviews-Dataset.csv" which is on the JupyterHub. The data comes from the UCI Machine Learning Repository, see https://archive.ics.uci.edu/ml/datasets/Las+Vegas+Strip

In [None]:
dataSet <- read.csv2("LasVegasTripAdvisorReviews-Dataset.csv")

In [None]:
head(dataSet)