Skip to content


Folders and files

Last commit message
Last commit date

Latest commit


Repository files navigation


The goal of pitchR is to provide an accessible dataset with advanced pitching statistics and salary data for individual starting pitchers from the 2018-2020 regular seasons. As a robust and tidy dataset, pitchR provides a great resource for modeling with baseball’s most novel advanced statistics ⚾.


The development version of pitchR is available from GitHub with:

# install.packages("devtools")
# devtools::install_github("Reed-Math241/pkgGrpq")

About the Data

Salary data was collected from Spotrac and advanced pitching statistics from Baseball Savant’s Statcast. The full scraping and cleaning process is documented here.

The pitchR package contains one dataset, with 24 variables and 662 observations.


Here is a simplified version of the data; run ?pitchR for a more in-depth description:

head(pitchR, 3)
#> # A tibble: 3 x 24
#>   name  salary team   Year pitches player_id    ba   iso babip   slg  woba xwoba
#>   <chr>  <dbl> <chr> <dbl>   <dbl>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 clay… 3.56e7 LAD    2018    2364    477132 0.227 0.139 0.276 0.366 0.272 0.285
#> 2 rich… 1.67e7 LAD    2018    2104    448179 0.219 0.181 0.272 0.4   0.297 0.309
#> 3 hyun… 7.83e6 LAD    2018    1238    547943 0.221 0.14  0.282 0.362 0.268 0.278
#> # … with 12 more variables: xba <dbl>, hits <dbl>, abs <dbl>,
#> #   launch_speed <dbl>, launch_angle <dbl>, spin_rate <dbl>, velocity <dbl>,
#> #   effective_speed <dbl>, whiffs <dbl>, swings <dbl>, takes <dbl>,
#> #   release_extension <dbl>

Here is a breakdown of how much missing data we have by variable. We opted to keep observations with missing values in order to keep a full version of the salary data.


By virtue of pitchR having data from 3 different years, there is a lot of summarizing and comparing that can be done. For example:


pitchR %>% 
#> # A tibble: 3 x 2
#>    Year     n
#> * <dbl> <int>
#> 1  2018   251
#> 2  2019   211
#> 3  2020   200

pitchR %>% 
  group_by(Year) %>% 
  summarize(across(where(is.numeric), mean, na.rm = T))
#> # A tibble: 3 x 22
#>    Year salary pitches player_id    ba   iso babip   slg  woba xwoba   xba  hits
#> * <dbl>  <dbl>   <dbl>     <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  2018 4.50e6   1716.   563538. 0.255 0.173 0.294 0.428 0.327 0.333 0.256  98.6
#> 2  2019 5.15e6   1906.   576284. 0.259 0.192 0.301 0.451 0.327 0.332 0.259 112. 
#> 3  2020 5.76e6    759.   595009. 0.242 0.173 0.284 0.416 0.310 0.312 0.249  41.7
#> # … with 10 more variables: abs <dbl>, launch_speed <dbl>, launch_angle <dbl>,
#> #   spin_rate <dbl>, velocity <dbl>, effective_speed <dbl>, whiffs <dbl>,
#> #   swings <dbl>, takes <dbl>, release_extension <dbl>

Another exciting feature of the package is the inclusion of expected statistics:

Query functions

pitchR also has a built in function called get_salary() that takes a year and a team as it’s inputs and outputs a tibble of each pitchers salary on that team during that year. This is different from the full dataset in pitchR because that only includes starting pitchers.

Since it uses webscraping to do this, the function only accepts team names written in a very particular fashion. In general the names are all lowercase and spaces are replaced with dashes. You can print the list of all 30 accepted team names by using the list_teams() function.

#>  [1] "los-angeles-dodgers"   "new-york-yankees"      "philadelphia-phillies"
#>  [4] "houston-astros"        "los-angeles-angels"    "boston-red-sox"       
#>  [7] "new-york-mets"         "washington-nationals"  "san-diego-padres"     
#> [10] "st-louis-cardinals"    "chicago-cubs"          "san-francisco-giants" 
#> [13] "toronto-blue-jays"     "atlanta-braves"        "chicago-white-sox"    
#> [16] "minnesota-twins"       "cincinnati-reds"       "colorado-rockies"     
#> [19] "kansas-city-royals"    "arizona-diamondbacks"  "texas-rangers"        
#> [22] "milwaukee-brewers"     "detroit-tigers"        "seattle-mariners"     
#> [25] "oakland-athletics"     "tampa-bay-rays"        "miami-marlins"        
#> [28] "baltimore-orioles"     "pittsburgh-pirates"    "cleveland-indians"

Now, we can use get_salary() to pull some salary data. Since the output is a tibble, we can easily plot this data:

get_salary(2018, "colorado-rockies") %>% 
  mutate(name = fct_reorder(name, salary)) %>% 
  ggplot(aes(name, salary)) +
  geom_col() +
  coord_flip() +

For more information on using this function you can run ?get_salary


No description, website, or topics provided.






No releases published


No packages published
