README.Rmd

---
output: github_document
editor_options: 
  chunk_output_type: console
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "75%",
  warning = FALSE,
  message = FALSE,
  fig.retina = 2,
  fig.align = 'center'
)
library(tidyverse)
library(pitchR)
```


# pitchR  <img src= "https://github.com/Reed-Math241/pkgGrpq/blob/master/figs/IMG_0175.png" align="right" width=175 />


<!-- badges: start -->
<!-- badges: end -->


The goal of `pitchR` is to provide an accessible dataset with advanced pitching statistics and salary data for individual starting pitchers from the 2018-2020 regular seasons. As a robust and tidy dataset, `pitchR` provides a great resource for modeling with baseball's most novel advanced statistics ⚾.


```{r hist, fig.width=14, fig.height=8, echo=FALSE}
pitchR %>% 
  ggplot(aes(x = woba, fill = factor(Year))) +
  geom_density(position = "identity", alpha = 0.5, color = NA) +
  scale_fill_manual(values = c("grey65", "midnightblue", "cyan4")) +
  theme_minimal() +
  labs(
    title = "Density distributions of woba by year",
    fill = "Year"
  ) +
  theme(
    plot.title.position = "plot",
    plot.subtitle = element_text(margin = margin(10, 0, 20, 0), size = 19),
    plot.title = element_text(size = 25, margin = margin(10, 0, 20, 0), face = "bold"),
    axis.text = element_text(size = 19),
    axis.title = element_text(size = 19, margin = margin(10, 10, 16, 10)),
    legend.key.size = unit(2, "cm"),
    legend.text = element_text(size = 15),
    legend.title = element_text(size = 20)
  )
```


## Installation

The development version of `pitchR` is available from [GitHub](https://github.com/Reed-Math241/pkgGrpq) with:

``` {r}
# install.packages("devtools")
# devtools::install_github("Reed-Math241/pkgGrpq")
```

## About the Data

Salary data was collected from [Spotrac](https://www.spotrac.com/mlb/rankings/2018/salary/starting-pitcher/) and advanced pitching statistics from Baseball Savant's [Statcast](https://baseballsavant.mlb.com/statcast_search). The full scraping and cleaning process is documented [here](https://github.com/Reed-Math241/pkgGrpq/blob/master/data-raw/DATASET.R).

The `pitchR` package contains one dataset, with 24 variables and 662 observations.

```{r showdata}
library(pitchR)
data('pitchR')
``` 

```{r, echo=FALSE}
devtools::load_all()
```


Here is a simplified version of the data; run `?pitchR` for a more in-depth description: 

```{r example-pitchR}
head(pitchR, 3)
```

Here is a breakdown of how much missing data we have by variable. We opted to keep observations with missing values in order to keep a full version of the salary data.

```{r missing-data, echo=FALSE, fig.width=14, fig.height=8}
pitchR %>% 
  summarise(across(everything(), ~mean(!is.na(.)))) %>% 
  gather() %>% 
  mutate(key = fct_reorder(key, value)) %>% 
  ggplot(aes(key, value, fill = value > .90)) +
  geom_col(
    alpha = 0.3,
    color = NA,
    show.legend = F
  ) +
  scale_fill_manual(name = "",
                    values = c("midnightblue", "cyan4")) +
  geom_text(
    aes(label = scales::percent(value)),
    nudge_y = 0.05,
    size = 8
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_blank(),
    axis.text.y = element_text(size = 19),
    axis.title = element_text(size = 19, margin = margin(10, 10, 16, 10)),
  ) +
  labs(
    x = "",
    y = "% of data present"
  ) +
  coord_flip()
```


## Examples

By virtue of `pitchR` having data from 3 different years, there is a lot of summarizing and comparing that can be done. For example:

```{r example, warning=FALSE, message=FALSE}
library(tidyverse)

pitchR %>% 
  count(Year)

pitchR %>% 
  group_by(Year) %>% 
  summarize(across(where(is.numeric), mean, na.rm = T))
```

Another exciting feature of the package is the inclusion of expected statistics:

```{r pitcher_woba-1, echo=FALSE, warning=FALSE, fig.width=14, fig.height=8}
pitcher <- pitchR %>%
  group_by(name) %>%
  summarize(
    ba = mean(ba),
    xba = mean(xba),
    salary = mean(salary)
    )
ggplot(data = pitcher, aes(x = ba, y = xba)) +
  geom_point(aes(color = salary), size = 5) +
  geom_abline(slope = 1, intercept = 0, alpha = 0.3, size = 3, color = "midnightblue") +
  scale_color_gradient(low = "gray80", high = "turquoise4") +
  labs(
    title = "Batting Average vs Expected Batting Average by Pitcher",
    subtitle = "Batting Average: hits/# of at bats, expected values are derived by
comparing the exit velocity and launch angle of batted balls against historical outcomes",
    x = "Batting Average",
    y = "Expected Batting Average",
    color = "Salary"
    ) +
  theme_minimal() +
  theme(
    plot.title.position = "plot",
    plot.subtitle = element_text(margin = margin(10, 0, 20, 0), size = 19),
    plot.title = element_text(size = 25, face = "bold"),
    axis.text = element_text(size = 19),
    axis.title = element_text(size = 19, margin = margin(10, 10, 16, 10)),
    legend.key.size = unit(2, "cm"),
    legend.text = element_text(size = 15),
    legend.title = element_text(size = 20)
    )

```


## Query functions

`pitchR` also has a built in function called `get_salary()` that takes a year and a team as it's inputs and outputs a tibble of each pitchers salary on that team during that year. This is different from the full dataset in `pitchR` because that only includes starting pitchers.

Since it uses webscraping to do this, the function only accepts team names written in a very particular fashion. In general the names are all lowercase and spaces are replaced with dashes. You can print the list of all 30 accepted team names by using the `list_teams()` function.

```{r}
list_teams()
```

Now, we can use `get_salary()` to pull some salary data. Since the output is a tibble, we can easily plot this data:

```{r bars}
get_salary(2018, "colorado-rockies") %>% 
  mutate(name = fct_reorder(name, salary)) %>% 
  ggplot(aes(name, salary)) +
  geom_col() +
  coord_flip() +
  theme_minimal()
```

For more information on using this function you can run `?get_salary`