*Analytical Information Systems*

# Descriptive Analytics in R - Baseball Salaries

Prof. Christoph M. Flath<br>
Lehrstuhl für Wirtschaftsinformatik und Informationsmanagement

SS 2019

<h1>Agenda<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Download-and-preprocess-data" data-toc-modified-id="Download-and-preprocess-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Download and preprocess data</a></span></li><li><span><a href="#Descriptive-Statistics" data-toc-modified-id="Descriptive-Statistics-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Descriptive Statistics</a></span><ul class="toc-item"><li><span><a href="#Central-Tendency" data-toc-modified-id="Central-Tendency-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Central Tendency</a></span></li><li><span><a href="#Variability" data-toc-modified-id="Variability-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Variability</a></span></li><li><span><a href="#Shape" data-toc-modified-id="Shape-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Shape</a></span></li></ul></li><li><span><a href="#Visualization" data-toc-modified-id="Visualization-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Visualization</a></span><ul class="toc-item"><li><span><a href="#Histogram" data-toc-modified-id="Histogram-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Histogram</a></span></li><li><span><a href="#Faceting" data-toc-modified-id="Faceting-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Faceting</a></span></li><li><span><a href="#Dot-Plot-Histogram" data-toc-modified-id="Dot-Plot-Histogram-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Dot Plot Histogram</a></span></li><li><span><a href="#Boxplot" data-toc-modified-id="Boxplot-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Boxplot</a></span></li></ul></li></ul></div>

## Download and preprocess data

In [None]:
library(tidyverse)
file_url <- "https://www.dropbox.com/s/ysd0zljicq5yqfo/baseball.csv?dl=1"

file_url %>%
    read_csv2() %>%
    mutate(Salary = str_replace_all(Salary,"\\$","")) %>%
    mutate(Salary = str_replace_all(Salary,",","")) %>%
    mutate(Salary = as.numeric(Salary) / 1000000) -> salaries

Have a quick look at the data

In [None]:
glimpse(salaries)

## Descriptive Statistics

### Central Tendency

In [None]:
salaries %>%
  summarise(mean=mean(Salary),
            median=median(Salary))

no direct function for mode

In [None]:
salaries %>%
  group_by(Salary) %>%
  summarize(count = n()) %>%
  arrange(-count) %>%
  head(5)

### Variability

In [None]:
salaries %>%
  summarise(range=max(Salary)-min(Salary),
            var=var(Salary),
            CoV=sd(Salary)/mean(Salary))

Tukey's five number summary (minimum, lower-hinge, median, upper-hinge, maximum)

In [None]:
fivenum(salaries$Salary)

Summary function

In [None]:
summary(salaries$Salary)

#### not meaningful without comparisons - let's do on team level

- range

In [None]:
salaries %>%
  group_by(Team) %>%
  summarize(range = diff(range(Salary))) %>%
  arrange(range)

- covariance

In [None]:
salaries %>%
  group_by(Team) %>%
  summarize(cov = sd(Salary)/mean(Salary)) %>%
  arrange(cov)

###  Shape

In [None]:
salaries %>%
  summarise(skew=psych::skew(Salary),
            kurt=psych::kurtosi(Salary))

In [None]:
salaries %>%
  group_by(Team) %>%
  summarize(skew = psych::skew(Salary)) %>%
  arrange(-skew)

In [None]:
salaries %>%
  group_by(Team) %>%
  summarize(skew = psych::skew(Salary)) %>%
  arrange(skew)

In [None]:
salaries %>%
  group_by(Team) %>%
  summarize(kurt = psych::kurtosi(Salary)) %>%
  arrange(-kurt)

In [None]:
salaries %>%
  group_by(Team) %>%
  summarize(kurt = psych::kurtosi(Salary)) %>%
  arrange(kurt)

## Visualization

### Histogram

In [None]:
options(repr.plot.width=4, repr.plot.height=4) 
salaries %>%
  ggplot(aes(x=Salary)) +
  geom_histogram(fill="white", color="black")

__Log scaled histogram__

In [None]:
salaries %>%
  ggplot(aes(x=Salary)) +
  geom_histogram(fill="white", color="black") +
  scale_x_log10()

### Faceting
__Small multiples Team with log scale__

In [None]:
salaries %>%
  ggplot(aes(x=Salary)) +
  geom_histogram(fill="white", color="black") +
  facet_wrap(~Team) +
  scale_x_log10()

__Small multiples Position with log scale__

In [None]:
salaries %>%
  ggplot(aes(x=Salary)) + 
  geom_histogram(fill="white", color="black") + 
  facet_wrap(~Position) + 
  scale_x_log10()

### Dot Plot Histogram

In [None]:
salaries %>%
  ggplot(aes(x=Salary)) + 
  geom_dotplot(binwidth = 0.5)

### Boxplot

In [None]:
salaries %>%
  ggplot(aes(x=Position,y=Salary)) + 
  xlab("") + 
  geom_boxplot() + 
    coord_flip()