pitchR

The goal of pitchR is to provide an accessible dataset with advanced pitching statistics and salary data for individual starting pitchers from the 2018-2020 regular seasons. As a robust and tidy dataset, pitchR provides a great resource for modeling with baseball’s most novel advanced statistics ⚾.

Installation

The development version of pitchR is available from GitHub with:

# install.packages("devtools")
# devtools::install_github("Reed-Math241/pkgGrpq")

About the Data

Salary data was collected from Spotrac and advanced pitching statistics from Baseball Savant’s Statcast. The full scraping and cleaning process is documented here.

The pitchR package contains one dataset, with 24 variables and 662 observations.

library(pitchR)
data('pitchR')

Here is a simplified version of the data; run ?pitchR for a more in-depth description:

head(pitchR, 3)
#> # A tibble: 3 x 24
#>   name  salary team   Year pitches player_id    ba   iso babip   slg  woba xwoba
#>   <chr>  <dbl> <chr> <dbl>   <dbl>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 clay… 3.56e7 LAD    2018    2364    477132 0.227 0.139 0.276 0.366 0.272 0.285
#> 2 rich… 1.67e7 LAD    2018    2104    448179 0.219 0.181 0.272 0.4   0.297 0.309
#> 3 hyun… 7.83e6 LAD    2018    1238    547943 0.221 0.14  0.282 0.362 0.268 0.278
#> # … with 12 more variables: xba <dbl>, hits <dbl>, abs <dbl>,
#> #   launch_speed <dbl>, launch_angle <dbl>, spin_rate <dbl>, velocity <dbl>,
#> #   effective_speed <dbl>, whiffs <dbl>, swings <dbl>, takes <dbl>,
#> #   release_extension <dbl>

Here is a breakdown of how much missing data we have by variable. We opted to keep observations with missing values in order to keep a full version of the salary data.

Examples

By virtue of pitchR having data from 3 different years, there is a lot of summarizing and comparing that can be done. For example:

library(tidyverse)

pitchR %>% 
  count(Year)
#> # A tibble: 3 x 2
#>    Year     n
#> * <dbl> <int>
#> 1  2018   251
#> 2  2019   211
#> 3  2020   200

pitchR %>% 
  group_by(Year) %>% 
  summarize(across(where(is.numeric), mean, na.rm = T))
#> # A tibble: 3 x 22
#>    Year salary pitches player_id    ba   iso babip   slg  woba xwoba   xba  hits
#> * <dbl>  <dbl>   <dbl>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  2018 4.50e6   1716.   563538. 0.255 0.173 0.294 0.428 0.327 0.333 0.256  98.6
#> 2  2019 5.15e6   1906.   576284. 0.259 0.192 0.301 0.451 0.327 0.332 0.259 112. 
#> 3  2020 5.76e6    759.   595009. 0.242 0.173 0.284 0.416 0.310 0.312 0.249  41.7
#> # … with 10 more variables: abs <dbl>, launch_speed <dbl>, launch_angle <dbl>,
#> #   spin_rate <dbl>, velocity <dbl>, effective_speed <dbl>, whiffs <dbl>,
#> #   swings <dbl>, takes <dbl>, release_extension <dbl>

Another exciting feature of the package is the inclusion of expected statistics:

Query functions

pitchR also has a built in function called get_salary() that takes a year and a team as it’s inputs and outputs a tibble of each pitchers salary on that team during that year. This is different from the full dataset in pitchR because that only includes starting pitchers.

Since it uses webscraping to do this, the function only accepts team names written in a very particular fashion. In general the names are all lowercase and spaces are replaced with dashes. You can print the list of all 30 accepted team names by using the list_teams() function.

list_teams()
#>  [1] "los-angeles-dodgers"   "new-york-yankees"      "philadelphia-phillies"
#>  [4] "houston-astros"        "los-angeles-angels"    "boston-red-sox"       
#>  [7] "new-york-mets"         "washington-nationals"  "san-diego-padres"     
#> [10] "st-louis-cardinals"    "chicago-cubs"          "san-francisco-giants" 
#> [13] "toronto-blue-jays"     "atlanta-braves"        "chicago-white-sox"    
#> [16] "minnesota-twins"       "cincinnati-reds"       "colorado-rockies"     
#> [19] "kansas-city-royals"    "arizona-diamondbacks"  "texas-rangers"        
#> [22] "milwaukee-brewers"     "detroit-tigers"        "seattle-mariners"     
#> [25] "oakland-athletics"     "tampa-bay-rays"        "miami-marlins"        
#> [28] "baltimore-orioles"     "pittsburgh-pirates"    "cleveland-indians"

Now, we can use get_salary() to pull some salary data. Since the output is a tibble, we can easily plot this data:

get_salary(2018, "colorado-rockies") %>% 
  mutate(name = fct_reorder(name, salary)) %>% 
  ggplot(aes(name, salary)) +
  geom_col() +
  coord_flip() +
  theme_minimal()

For more information on using this function you can run ?get_salary

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
R		R
data-raw		data-raw
data		data
figs		figs
man		man
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

R

R

data-raw

data-raw

data

data

figs

figs

man

man

DESCRIPTION

DESCRIPTION

NAMESPACE

NAMESPACE

README.Rmd

README.Rmd

README.md

README.md

Repository files navigation

pitchR

Installation

About the Data

Examples

Query functions

About

Releases

Packages

Contributors 3

Languages

Reed-Math241/pitchR

Folders and files

Latest commit

History

Repository files navigation

pitchR

Installation

About the Data

Examples

Query functions

About

Resources

Stars

Watchers

Forks

Languages