Skip to content

Commit

Permalink
0.5 release notes
Browse files Browse the repository at this point in the history
  • Loading branch information
BillPetti committed May 29, 2018
1 parent fc89498 commit b09a159
Showing 1 changed file with 136 additions and 106 deletions.
242 changes: 136 additions & 106 deletions baseballr_Updates/current-release-notes.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,137 +3,167 @@ layout: page
title: baseballr current release notes
tags: rstats, baseballr
---
## May 29, 2018

## April 9, 2018
The latest release of the [`baseballr`](https://billpetti.github.io/baseballr/) package for `R` (0.5) includes a number of enhancements and bug fixes.

The latest release of the [`baseballr`](https://billpetti.github.io/baseballr/) package for `R` (0.4.1) includes a number of enhancements and bug fixes.
## New Functions

There are now two functions that allow users to scrape player game logs from FanGraphs:
`run_expectancy_code()`

- `batter_game_logs_fg`
- `pitcher_game_logs_fg`
This function formats Baseball Savant data so that users can generate the run expectancy for different base-out or count-base-out states. It will also append the data frame with new variables necessary for generating linear weights (see new function below). The only argument is a data frame downloaded from Baseball Savant

Both functions has two arguments: `player_id` and `year`. Both will return detailed game logs from FanGraphs for the selected season.
Columns created and appended to Baseball Savant data:

- `final_pitch_game`: whether a pitch was the final one thrown in a game
- `final_pitch_inning`: whether a pitch is the final one thrown in an inning
- `final_pitch_at_bat`: whether a pitch is the final one thrown in an at bat
- `runs_scored_on_pitch`: how many runs scored as a result of the pitch
- `bat_score_start_inning`: the score for the batting team at the beginning of the inning
- `bat_score_end_inning`: the score for the batting team at the end of the inning
- `bat_score_after`: the score for the batting team after the pitch is thrown
- `cum_runs_in_inning`: how many cumulative runs have been scored from the beginning of the inning through the pitch
- `runs_to_end_inning`: how many runs were scored as a result of the pitch through the end of the inning
- `base_out_state` or `count_base_out_state`: the specific combination of base-outs or count-base-outs when the pitch was thrown
- `avg_re`: the average run expectancy of that base-out or count-base-out state
- `next_avg_re`: the average run expectancy of the base-out or count-base-out state that results from the pitch
- `change_re`: the change in run expectancy as a result of the pitch
- `re24`: the total change in run expectancy through the end of the inning resulting from the pitch based on the change in base-out or count-base-out state plus the number of runs scored as a result of the pitch/at bat

Example:

```r
> batter_game_logs_fg(playerid = 10155, year = 2018)
Date Team Opp BO Pos PA H X2B X3B HR R RBI SB CS BB_perc K_perc ISO
1 2018-04-04 LAA CLE 2 CF 5 0 0 0 0 0 0 0 0 0.40 0.200 .000
2 2018-04-03 LAA CLE 2 CF 5 1 0 0 1 1 1 0 0 0.20 0.000 .750
3 2018-04-02 LAA CLE 2 CF 4 0 0 0 0 0 0 0 0 0.25 0.250 .000
4 2018-04-01 LAA @OAK 2 CF 5 2 1 0 0 1 1 0 0 0.00 0.000 .200
5 2018-03-31 LAA @OAK 2 CF 5 3 2 0 0 2 2 1 0 0.00 0.200 .400
6 2018-03-30 LAA @OAK 2 CF 4 1 0 0 1 2 1 0 0 0.00 0.000 .750
7 2018-03-29 LAA @OAK 2 CF 6 0 0 0 0 0 0 0 0 0.00 0.167 .000
BABIP AVG OBP SLG wOBA wRC_plus
1 .000 .000 .400 .000 .276 78
2 .000 .250 .400 1.000 .559 280
3 .000 .000 .250 .000 .172 5
4 .400 .400 .400 .600 .432 190
5 .750 .600 .600 1.000 .686 371
6 .000 .250 .250 1.000 .526 257
7 .000 .000 .000 .000 .000 -100
> x2016_statcast_re <- run_expectancy_code(x2016_statcast)

> sample_n(x2016_statcast_re, 10) %>%
select(final_pitch_inning:re24) %>%
glimpse()

Observations: 10
Variables: 11
$ final_pitch_inning <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0
$ bat_score_start_inning <dbl> 1, 0, 5, 0, 3, 2, 1, 0, 0, 0
$ bat_score_end_inning <dbl> 2, 0, 5, 1, 3, 2, 5, 0, 0, 2
$ cum_runs_in_inning <dbl> 1, 0, 0, 0, 0, 0, 2, 0, 0, 1
$ runs_to_end_inning <dbl> 0, 0, 0, 1, 0, 0, 2, 0, 0, 1
$ base_out_state <chr> "2 outs, 1b _ _", "0 outs, _ _ _", "0 outs...
$ avg_re <dbl> 0.2149885, 0.5057877, 0.5057877, 0.5057877, 0.5...
$ next_base_out_state <chr> "2 outs, 1b 2b _", "1 outs, _ _ _", "1 out...
$ next_avg_re <dbl> 0.4063525, 0.2718802, 0.2718802, 0.8629357, 0.2...
$ change_re <dbl> 0.1913640, -0.2339075, -0.2339075, 0.3571479, -...
$ re24 <dbl> 0.1913640, -0.2339075, -0.2339075, 0.3571479, -...
```

`viz_gb_on_period` was contributed by [Daniel H](https://github.com/darh78) and allows a user to generate a time series of standings for a division and automatically visualize the data in an interactive chart.
`run_expectancy_table()`

This functions works with the `run_expectancy_code` function and does the work of generating the run expectancy tables that are automatically exported into the Global Environment

Example:

```r
> viz_gb_on_period("2018-03-29", "2018-04-05", "NL East")
|++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed = 16s
# A tibble: 10 x 7
League Date Team W L WLpct GB
<chr> <date> <chr> <int> <int> <dbl> <dbl>
1 NL East 2018-03-29 NYM 1 0 1.00 0.
2 NL East 2018-03-29 ATL 1 0 1.00 0.
3 NL East 2018-03-29 WSN 0 0 0. 0.500
4 NL East 2018-03-29 PHI 0 1 0. 1.00
5 NL East 2018-03-29 MIA 0 1 0. 1.00
6 NL East 2018-04-05 NYM 5 1 0.833 0.
7 NL East 2018-04-05 ATL 4 2 0.667 1.00
8 NL East 2018-04-05 WSN 4 3 0.571 1.50
9 NL East 2018-04-05 PHI 2 4 0.333 3.00
10 NL East 2018-04-05 MIA 2 5 0.286 3.50
> x2016_statcast_re %>%
run_expectancy_table() %>%
print(n=Inf)

base_out_state avg_re
<chr> <dbl>
1 0 outs, 1b 2b 3b 2.13
2 0 outs, _ 2b 3b 1.95
3 0 outs, 1b _ 3b 1.76
4 1 outs, 1b 2b 3b 1.55
5 0 outs, 1b 2b _ 1.42
6 1 outs, _ 2b 3b 1.36
7 0 outs, _ _ 3b 1.36
8 1 outs, 1b _ 3b 1.18
9 0 outs, _ 2b _ 1.14
10 1 outs, _ _ 3b 0.951
11 1 outs, 1b 2b _ 0.906
12 0 outs, 1b _ _ 0.863
13 2 outs, 1b 2b 3b 0.689
14 1 outs, _ 2b _ 0.669
15 2 outs, _ 2b 3b 0.525
16 1 outs, 1b _ _ 0.520
17 0 outs, _ _ _ 0.506
18 2 outs, 1b _ 3b 0.456
19 2 outs, 1b 2b _ 0.406
20 2 outs, _ _ 3b 0.366
21 2 outs, _ 2b _ 0.299
22 1 outs, _ _ _ 0.272
23 2 outs, 1b _ _ 0.215
24 2 outs, _ _ _ 0.106
```

![alt text](https://github.com/BillPetti/baseballr/blob/gh-pages/baseballr_Updates/vz_gb_chart_ex.png "vz_gb ex")
`linear_weights_savant()`

[Ben Dilday](https://github.com/bdilday) combined the various `scrape_statcast_savant` functions I previously released intro a single function. The single function can pull all data over a given date range for all pitchers or batters or just for specific pitchers or batters.
This function works in tandem with `run_expectancy_code()` to generate linear weights for offensive events after the Baseball Savant data has been properly formatted. Currently, the function will return linear weights above average and linear weights above outs. It does not apply any scaling to align with league wOBA. Users can do that themselves if they like, or it may be added to a future version of the function.

Example:

```r
>head(scrape_statcast_savant(start_date = "2016-04-06", end_date = "2016-04-15", playerid = 592789, player_type='pitcher'))
pitch_type game_date release_speed release_pos_x release_pos_z
1 FF 2016-04-12 97.3 -0.6733 6.4372
2 FF 2016-04-12 97.8 -0.6366 6.4466
3 FF 2016-04-12 97.6 -0.4936 6.4440
4 FF 2016-04-12 97.0 -0.6884 6.5753
5 SL 2016-04-12 91.6 -0.7873 6.4002
6 CH 2016-04-12 88.9 -1.0913 6.2130
player_name batter pitcher events description spin_dir
1 Noah Syndergaard 400085 592789 single hit_into_play NA
2 Noah Syndergaard 400085 592789 <NA> called_strike NA
3 Noah Syndergaard 425772 592789 field_out hit_into_play NA
4 Noah Syndergaard 425772 592789 <NA> called_strike NA
5 Noah Syndergaard 588751 592789 field_out hit_into_play NA
6 Noah Syndergaard 518618 592789 double hit_into_play_no_out NA
```
>head(scrape_statcast_savant(start_date = "2016-04-06", end_date = "2016-04-06"))
pitch_type game_date release_speed release_pos_x release_pos_z player_name batter pitcher
1 FT 2016-04-06 91.2 -1.9089 6.4077 Jake Marisnick 545350 467100
2 CU 2016-04-06 77.7 -1.7753 6.7376 Carlos Correa 621043 467100
3 CU 2016-04-06 80.3 -1.5339 6.6463 Carlos Correa 621043 467100
4 SL 2016-04-06 84.7 -1.7689 6.4903 Carlos Correa 621043 467100
5 FT 2016-04-06 90.8 -1.8843 6.4235 Carlos Correa 621043 467100
6 FT 2016-04-06 90.2 -1.7467 6.5141 George Springer 543807 467100
events description spin_dir spin_rate_deprecated break_angle_deprecated
1 field_out hit_into_play NA NA NA
2 strikeout swinging_strike_blocked NA NA NA
3 <NA> ball NA NA NA
4 <NA> called_strike NA NA NA
5 <NA> called_strike NA NA NA
6 walk ball NA NA NA

> x2016_statcast_re %>%
linear_weights_savant() %>%
print(n=Inf)

A tibble: 7 x 3
events linear_weights_above_average linear_weights_above_outs
<chr> <dbl> <dbl>
1 home_run 1.38 1.63
2 triple 1.00 1.25
3 double 0.730 0.980
4 single 0.440 0.690
5 hit_by_pitch 0.320 0.570
6 walk 0.290 0.540
7 outs -0.250 0.
```

Finally, I've added a function for generating spray charts based on my [Interactive Spray Chart Tool](http://billpetti.shinyapps.io/shiny_spraychart/).
I used Baseball Savant data from 2010-2015 and compared the linear weights generated by `baseballr` to those by Tom Tango using retrosheet data. `baseballr`'s weights are generally a little lower than what Tango generated, but that could be due to a number of things, such as the data source, code, etc., but the values appear reasonable enough to be reliable:

`ggspraychart` can generate either a typical spray chart or a density chart for a given hitter. The function takes a data frame with hit coordinates and allows users to customize fill colors and values and the transparency of points. Users can also adjust the bin value when generating density plots.
| base_out_state | baseballr_2010_2015 | tango_2010_2015 | diff | %_diff |
|--------------------|---------------------|-----------------|-------|--------|
| 0 outs, 1b 2b 3b | 2.27 | 2.29 | -0.02 | -1% |
| 0 outs, _ 2b 3b | 1.96 | 1.96 | 0 | 0% |
| 0 outs, 1b _ 3b | 1.76 | 1.78 | -0.03 | -1% |
| 1 outs, 1b 2b 3b | 1.51 | 1.54 | -0.03 | -2% |
| 0 outs, 1b 2b _ | 1.42 | 1.44 | -0.02 | -1% |
| 0 outs, _ _ 3b | 1.38 | 1.38 | 0 | 0% |
| 1 outs, _ 2b 3b | 1.35 | 1.35 | 0 | 0% |
| 1 outs, 1b _ 3b | 1.1 | 1.13 | -0.03 | -2% |
| 0 outs, _ 2b _ | 1.09 | 1.1 | -0.01 | -1% |
| 1 outs, _ _ 3b | 0.93 | 0.95 | -0.02 | -2% |
| 1 outs, 1b 2b _ | 0.86 | 0.88 | -0.02 | -3% |
| 0 outs, 1b _ _ | 0.84 | 0.86 | -0.02 | -2% |
| 2 outs, 1b 2b 3b | 0.71 | 0.75 | -0.04 | -5% |
| 1 outs, _ 2b _ | 0.65 | 0.66 | -0.01 | -2% |
| 2 outs, _ 2b 3b | 0.54 | 0.58 | -0.04 | -7% |
| 1 outs, 1b _ _ | 0.5 | 0.51 | -0.01 | -2% |
| 0 outs, _ _ _ | 0.48 | 0.48 | 0 | -1% |
| 2 outs, 1b _ 3b | 0.45 | 0.48 | -0.03 | -7% |
| 2 outs, 1b 2b _ | 0.41 | 0.43 | -0.02 | -4% |
| 2 outs, _ _ 3b | 0.33 | 0.35 | -0.02 | -6% |
| 2 outs, _ 2b _ | 0.31 | 0.32 | -0.01 | -3% |
| 1 outs, _ _ _ | 0.25 | 0.25 | 0 | -1% |
| 2 outs, 1b _ _ | 0.21 | 0.22 | -0.01 | -6% |
| 2 outs, _ _ _ | 0.1 | 0.1 | 0 | -2% |

Keep in mind that the `hc_y` coordinate provided by baseballsavant needs to be inverted in order to properly plot the data. (I typically create a variable, `hc_y_rotated` by multiplying `hc_y` and use that for plotting.)
We also had some great contributions by others that I've added into this release:

Here's are point and density examples using data for Jose Altuve:
`label_statcast_imputed_data()`

```r
ggspraychart(data, point_alpha = .6, fill_legend_title = "Hit Type", fill_value = "hit_type",
fill_palette = c("1"="#A2C8EC", "2"="#006BA4", "3"="#FF940E",
"Out"="#595959", "4"="#C85200")) +
facet_wrap(~game_year, nrow = 2) +
ggtitle("\nJose Altuve") +
labs(subtitle = "Spray Charts Since 2013\n")
```
![alt text](https://github.com/BillPetti/baseballr/blob/gh-pages/baseballr_Updates/altuve_facet_ex.png "facet ex")
[Ben Dilday](https://github.com/bdilday) again contributes with a cool experimental function meant to tag batted ball cases where significant imputation may have been used to generate some of the Statcast values by MLBAM, i.e. `launch_speed` and `launch_angle`. You can read more about Ben's function [here](https://github.com/BillPetti/baseballr/pull/71).

```r
ggspraychart(data, point_alpha = .2, density = TRUE, bin_size = 30) +
facet_wrap(~game_year, nrow = 2) +
ggtitle("\nJose Altuve") +
labs(subtitle = "Spray Charts Since 2013\n")
```
![alt text](https://github.com/BillPetti/baseballr/blob/gh-pages/baseballr_Updates/altuve_facet_density.png "density ex")

The function is also written in such a way where it can be combined with `gganimate` to create animated plots:
`fg_park()`

```r
require(gganimate)
[Sam Boysel](https://github.com/sboysel) updated the park factors function so that it now includes the new columns added by FanGraphs (5-year, 3-year, 1-year park factors) and ensures the column names are correct

years <- c(2013, 2014, 2015, 2016, 2017)
## Updgrades

p <- ggspraychart(data, density = TRUE, point_alpha = .2, bin_size = 30, frame = "game_year") +
ggtitle("\n\n Jose Altuve's Evolution by Year:") +
labs(caption = "@BillPetti\nData source: baseballsavant.com\nBuilt with the baseballr package\n") +
theme(plot.caption = element_text(face = "bold", size = 14))
`fg_bat_leaders()`

gganimate(p, ani.width=800, ani.height=800)
```
- `playerid` now returned as part of the data returned.
- Dozens of additional variables are also returned, including aggregate data from Pitch Info as well as contact type.

## Bug Fixes

![alt text](https://github.com/BillPetti/baseballr/blob/gh-pages/baseballr_Updates/Altuve_evolution.gif?raw=true "gif e example")

Be sure whatever variable you assign to the `frame` argument is a factor and the levels are in the desired order for the animation.
`process_statcast_payload()`
- hc_x, hc_y are now converted to numeric

0 comments on commit b09a159

Please sign in to comment.