# Hockey Analytics with the Stattleship API

__@BrockTibert__    

__Feburary 2016__

<hr/>

# Outline

-  Introduction and Setup
-  Getting Started with `R`
-  Exploring the API
-  Game logs
-  The Stats endpoint
-  Scoring patterns of tonight's matchup  
-  Clustering Allstar Skaters  
-  Scoring Networks


<hr/>

# Setup

Before we get started a few helpful resources:


### _Get your API_

To use the API, you need a token.  Get yours at [www.stattleship.com](https://stattleship.com/)


### _Explore what is possible_

1.  Take a look at [the playbook](http://playbook.stattleship.com/)

2.  You can also look at other code samples at http://developers.stattleship.com/#introduction

### _Github Development_

Check out the Github repo at https://github.com/stattleship

### API Access

It's a `REST` API, so getting data is fairly straightforward for most programming languages and the command line and `curl`.  But to make it easier, we have been working on a few wrappers in common languages.

-  [Ruby](https://github.com/stattleship/stattleship-ruby)  
-  [R](https://github.com/stattleship/stattleship-r)  
-  [Python](https://github.com/stattleship/stattleship-python)  


<hr/>

# Getting Started with R

Some requirements:

1.  `R`, which can be downloaded at https://cran.r-project.org/
2.  You need your API token, which you can get as described above.  

I am using a jupyter notebook for this talk, but most of the time you would probably want to use __Rstudio__ as your IDE.  I highly recommend downloading it at https://www.rstudio.com/.



## Install the R Package for Stattleship

To install the R wrapper for the Stattleship API, you will need the devtools package.  From the R REPL, type:

```
install.packages("devtools")
```

It may ask you for a mirror if you are not using `RStudio`, but simply select one that is closest to you.

Once `devtools` is good to go, you can install our R package by:

```
devtools::install_github("stattleship/stattleship-r")
```

You should be good to go.  If you have trouble with this, let us know, and we should be able to help you out.

Below, I will use a few other packages as well.  They are pretty helpful to have installed for everday R coding.

```
install.packages("dplyr")
install.packages("stringr")
install.packages("lubridate")
install.packages("ggplot2")
install.packages("tidyr")
install.packages("RColorBrewer")
```

<hr/>

# Explore the API

_Let's get started ..._

In [None]:
## factors are the devil
options(stringsAsFactors = FALSE)

## install some packages
library(stattleshipR)
suppressPackageStartupMessages(library("dplyr"))
library("lubridate")
library("stringr")
library("ggplot2")
library("tidyr")
library("RColorBrewer")



In [None]:
## to access the API, you need to set your token
set_token(Sys.getenv("STATTLE_TOKEN"))

Above, I set my token with an `environment variable`, butyou could have set the token just as easily by:

```
set_token("YourTokenGoesHere")
```

I just prefer to set my API tokens as environment variables on my system so I don't have to type them out each time.  

For more help on R Environment variables, poke around for help on `.Reviron` files, which is simply a text file of commands that R reads in at startup.  I keep this in my home directory `~`.

## Helpers

Out-of-the-box, there are a few helper functions that make it easy to get started.

### Teams

In [None]:
## lets pull down all of the teams
teams <- hockey_teams()

In [None]:
## what do we have
class(teams)

In [None]:
## what are the columns of data
colnames(teams)

In [None]:
## lets look at the first few rows of data
head(teams, 3)

A few notes about some really helpful conventions:

## _Quick Tip_

Teams, players, and games have slugs.  They take the form of:

- `nhl-bos`  
- `nhl-ryan-spooner`
- `nhl-2015-2016-ott-bos-2016-04-9-1230`

Slugs will help you as you dive into your own ideas and explore the API.

### Games


In [None]:
## The functions are documented
?hockey_games

In [None]:
## Get all of the B's regular season games for the season
bos_games <- hockey_games()

In [None]:
## what do we have
dim(bos_games)

In [None]:
colnames(bos_games)
head(bos_games, 3)

### Players

In [None]:
bos_players <- hockey_players()

In [None]:
colnames(players)
head(players, 3)

### Even Injuries

In [None]:
## get the injuries
bos_injuries <- hockey_injuries(team_id = "nhl-bos")

In [None]:
dim(bos_injuries)

In [None]:
colnames(bos_injuries)

In [None]:
## grab a few key columns and print
select(bos_injuries, started_on, location_name, note, status) %>% head(3)

We will be doing much more with helpers in the coming weeks, but the real fun is when you play around with the results at a more granular level. In the meantime, I am going to source a few helpers that I will use next.  These are not part of the R package, but may be in the future.

In [None]:
## source some helper functions
## the key function is parse_stattle
devtools::source_url("http://bit.ly/1OY151g")

<hr/>

# A Deeper Dive into the API

The real workhorse of the R package right now is `ss_get_result`.  

This function allows us to interface with the endpoints at a granular level

In [None]:
?ss_get_result

In [None]:
## going back to Boston's games, we can filter the results
## below I am going to pull all of the finished bruins regular seasons gmes
qbody <- list(team_id = "nhl-bos", interval_type="regularseason", status="ended")
bos_games <- ss_get_result(ep="games", query=qbody, walk=TRUE)

In [None]:
## the data are returned in lists to allow you more flexibility
length(bos_games)

In [None]:
names(bos_games)

When we set `walk=TRUE`, we are paging through the results of the API.  Each page of data is returned as en entry in the R list.

In [None]:
class(bos_games)

In [None]:
names(bos_games[[1]])

The above represent the data that come back to us. While we could use `do.call` to parse things out, one of the helper functions loaded above makes things even easier.  It is __parse_stattle()__.  We can use the function to get the entry across all of the pages.

In [None]:
## it takes the raw results, and the API entry
games <- parse_stattle(bos_games, "games")

In [None]:
## what comes back?
class(games)

In [None]:
dim(games)

In [None]:
colnames(games)

In [None]:
head(games, 3)

# Game Logs


![friends](https://33.media.tumblr.com/047a3fb868caa569ae2432c083f2fa7c/tumblr_inline_n6ngg3PFX51szj4b9.gif)

In [None]:
## teams have game logs.  Lets get the Bruins
qbody <- list(team_id = "nhl-bos", interval_type="regularseason", status="ended")
gl_raw <- ss_get_result(ep="team_game_logs", query=qbody, walk=TRUE)

In [None]:
logs <- parse_stattle(gl_raw, "team_game_logs")

In [None]:
colnames(logs)

In [None]:
head(logs, 3)

In [None]:
## plot goals scored by wins and losses -- excludes overtime
filter(logs, team_outcome %in% c("win","loss")) %>% 
 ggplot(aes(team_score)) + geom_density() + facet_grid(team_outcome ~ .)

In [None]:
## players also have game logs
qbody <- list(player_id = "nhl-patrick-kane", interval_type="regularseason", status="ended")
gl_raw <- ss_get_result(ep="game_logs", query=qbody, walk=TRUE)

In [None]:
kane_logs <- parse_stattle(gl_raw, "game_logs")
games <- parse_stattle(gl_raw, "games")

In [None]:
colnames(kane_logs)
colnames(games)

In [None]:
## some quick cleanup to merge the game data onto Kane's logs
names(games)[1] <- "game_id"
games <- select(games, game_id, attendance, started_at)
kane_logs <- left_join(kane_logs, games)

In [None]:
## prep the data for a point streak
kane_logs <- arrange(kane_logs, started_at)
kane_logs <- replace_na(kane_logs, replace = list(points=0))
kane_logs <- mutate(kane_logs, gameid = 1:n())

In [None]:
ggplot(kane_logs, aes(x=gameid, y=points)) + 
 geom_line() + 
 geom_point(aes(colour = factor(team_outcome)), size=3) + 
 theme_bw()

# Stats Endpoint

We can get data for an individual stat.  Let's look at helpers.


In [None]:
qbody <- list(player_id = "nhl-erik-karlsson", 
              interval_type="regularseason", stat="assists", type="hockey_offensive_stat")
stat_raw <- ss_get_result(ep="stats", query=qbody, walk=TRUE)



In [None]:
helpers <- parse_stattle(stat_raw, "stats")
games <- parse_stattle(stat_raw, "games")
games <- select(games, id, started_at)
names(games)[1] <- "game_id"
helpers <- left_join(helpers, games)
helpers <- replace_na(helpers, replace = list(stat=0))
helpers <- arrange(helpers, started_at)
helpers <- transform(helpers, gameid = 1:nrow(helpers))

In [None]:
ggplot(helpers, aes(x=gameid, y=stat)) + 
 geom_line() + 
 geom_point(size=3) + 
 theme_bw()

<hr/>

# Total Stats

In addition to getting a stat across the games (for a player __or__ team), it's possible to get a total stat.  

Below, I want to dive into special teams play.

In [None]:
## a helper function for teams
devtools::source_url("http://bit.ly/1QXoKEX")

In [None]:
## get the teams
teams <- hockey_teams()
## remove the allstar teams
teams <- filter(teams, !slug %in% c("atl", "metro", "pac", "cent"))

In [None]:
## keep the slugs as dataframe
teams <- select(teams, id, nickname, slug)

In [None]:
## get the powerplays, penalties, and some stats on special teams
get_tot_team_stats("power_plays", "hockey_team_stat", teams$slug, teams)
get_tot_team_stats("penalties", "hockey_team_stat", teams$slug, teams)
get_tot_team_stats("goals_power_play", "hockey_team_stat", teams$slug, teams)
get_tot_team_stats("player_points_power_play", "hockey_team_stat", teams$slug, teams)
get_tot_team_stats("player_points_short_handed", "hockey_team_stat", teams$slug, teams)


In [None]:
dim(teams)

In [None]:
colnames(teams)

In [None]:
summary(teams)

In [None]:
ggplot(teams, aes(x=power_plays, y=penalties, label=slug)) +
 geom_text() + 
 geom_hline(aes(yintercept=193.5), linetype="dotted", colour="red") +
 geom_vline(aes(xintercept=154), linetype="dotted", colour="red") + 
 theme_bw()

In [None]:
## best powerplays
teams <- transform(teams, pp_pct = goals_power_play / power_plays)
arrange(teams, desc(pp_pct)) %>% select(slug, pp_pct) %>% head(5)

In [None]:
## a super naive fit
lm(player_points_power_play ~ player_points_short_handed, teams)

In [None]:
## put the fit line on 
ggplot(teams, aes(x=player_points_short_handed, y=player_points_power_play, label=slug)) +
 geom_text() + 
 geom_abline(intercept=78.7699, slope=0.5859) + 
 theme_bw()

In [None]:
## smooth the pattern
ggplot(teams, aes(x=player_points_short_handed, y=player_points_power_play, label=slug)) +
 geom_text() + 
 geom_smooth() +
 theme_bw()

I want to give credit to @IneffectiveMath and all the work that he does around visualization and analysis.  While not exactly the same, I recently saw this tweet and wanted to highlight that similar analyses are possible with the API.

![pp](https://pbs.twimg.com/media/CX62Cb2W8AAQrNC.png)

<hr/>

# B's and Maple Leafs Scoring Patterns

We recently announced two new additions to the API; scoring plays and penalties.  This section will explore the scoring patterns of tonight's matchup between the Bruins and the Maple Leafs.  

The previous post can be found [here](http://blog.stattleship.com/hockey-meetup/)

In [None]:
## get the scoring plays for tonights teams
bos_sp <- ss_get_result(ep="scoring_plays",
                        query=list(team_id="nhl-bos", status="ended"), 
                        walk=TRUE,
                        verbose=FALSE)
tor_sp <- ss_get_result(ep="scoring_plays",
                        query=list(team_id="nhl-tor", status="ended"), 
                        walk=TRUE,
                        verbose=FALSE)


In [None]:
## parse plays
bos_plays <- parse_stattle(bos_sp, "scoring_plays")
tor_plays <- parse_stattle(tor_sp, "scoring_plays")

## keep regulation
bos_plays <- filter(bos_plays, period_number <= 3)
tor_plays <- filter(tor_plays, period_number <= 3)

## some transformations
bos_plays <- transform(bos_plays,  
                       minute = ceiling(period_seconds/60),
                       team = "Bruins")
tor_plays <- transform(tor_plays,  
                       minute = ceiling(period_seconds/60),
                       team = "Maple Leafs")

## cleanup
bos_plays <- select(bos_plays, -scoring_player_ids)
tor_plays <- select(tor_plays, -scoring_player_ids)

## join
scoring <- bind_rows(bos_plays, tor_plays)


In [None]:
## quick check
with(scoring, table(team))

In [None]:
## summarize the % of goals by period/minute for each team
scoring_summ <- tbl_df(scoring) %>%
 group_by(team, period_number, minute) %>%
 summarise(goals = n())

## make the pct
scoring_summ <- transform(scoring_summ, 
                          pct_goals = ifelse(team=="Bruins", goals/143, goals/110))


In [None]:
## heatmap of scoring by period
ggplot(scoring_summ, aes(x=period_number, y=minute)) +
 geom_tile(aes(fill=pct_goals), colour = "white") + 
 facet_grid(~team) +
 scale_fill_gradient(low = "white", high = "red") 

<hr/>

# Clustering Allstar Skaters

The previous post on this can be found [here](http://blog.stattleship.com/one-of-these-is-not-like-the-other/).

Below, we will pull down data on the All-star roster, and using the first-half performance, cluster the skaters elected to Allstar Weekend.

In [None]:
## load the other packages
library(googlesheets)

## and bring in the cached data for the post
load("allstar.rdata")

In [None]:
## source some helper functions
devtools::source_url("http://bit.ly/1OY151g")

In [None]:
## get the allstar roster from a google doc
key <- "12dUvvbAc5h7uH9GaJDiJolXaicY90OAe3cqBT8vjASo"
star <- gs_key(key)
roster <- gs_read(star, 
                  ws = "roster", 
                  range="A1:D45")
rm(key, star)

In [None]:
## quick look at the roster
head(roster)

In [None]:
## create the master allstar dataframe
allstar <- data.frame()


## get the games played
for (player in roster$slug) {
  x <- count_games(player)
  allstar <- bind_rows(allstar, x)
  cat("added ", player, "\n")
} 
rm(x, player)
allstar <- unique(allstar)

In [None]:
## quick sanity check
head(allstar)

In [None]:
## get the stats using the helper functions and the total_stats endpoing
get_tot_stat("goals", "hockey_offensive_stat", allstar$slug, allstar, parse_player=TRUE)
get_tot_stat("assists", "hockey_offensive_stat", allstar$slug, allstar)
get_tot_stat("shots", "hockey_offensive_stat", allstar$slug, allstar)
get_tot_stat("penalty_minutes", "hockey_player_stat", allstar$slug, allstar)
get_tot_stat("plus_minus", "hockey_player_stat", allstar$slug, allstar)
get_tot_stat("shifts", "hockey_player_stat", allstar$slug, allstar)
get_tot_stat("time_on_ice_even_strength_secs", "hockey_player_stat", allstar$slug, allstar)
get_tot_stat("time_on_ice_power_play_secs", "hockey_player_stat", allstar$slug, allstar)
get_tot_stat("time_on_ice_short_handed_secs", "hockey_player_stat", allstar$slug, allstar)
get_tot_stat("faceoff_win_percentage", "hockey_face_off_stat", allstar$slug, allstar)
get_tot_stat("blocked_shots", "hockey_defensive_stat", allstar$slug, allstar)
get_tot_stat("hits", "hockey_defensive_stat", allstar$slug, allstar)

In [None]:
## the helper added data
colnames(allstar)

In [None]:
## quick look
filter(allstar, pos != 'G') %>% head(3)

In [None]:
## keep just the skaters
skaters <- filter(allstar, pos != 'G')

In [None]:
## filter just skaters and keep the columns we want for the clustering
star_stats <- filter(allstar, pos != 'G') %>% 
  select(slug, pos, salary, games_played, goals:hits)
star_stats <- as.data.frame(star_stats)
player_slugs <- star_stats$slug
player_pos <- star_stats$pos
star_stats$slug <- NULL

In [None]:
## some metrics and cleanup for clustering
star_stats <- mutate(star_stats, 
                     points = goals + assists,
                     value = salary / points)
star_stats <- select(star_stats, -salary, -pos)
row.names(star_stats) <- player_slugs


In [None]:
str(star_stats)

In [None]:
## scale the variables
scale_stats <- data.frame(lapply(star_stats, function(x) scale(x)))

In [None]:
## cluster
star_clust <- hclust(dist(scale_stats))

In [None]:
## plot
plot(star_clust, labels=row.names(star_stats), main="2016 All Star Clustering", xlab="")

### But in the end .....

<br>

![scott](https://cdn2.vox-cdn.com/thumbor/Dfa3SxrddruH_BdsUSih8l1ZADc=/cdn0.vox-cdn.com/uploads/chorus_asset/file/5984079/YES.0.gif)

<hr/>

# Scoring Networks

As noted above, we recently published events for NHL games, most notably scoring plays and penalties.  Below I am going to work through how you might think of these data as a network.

The basic idea is to think of how a goal is scored, as referenced through the goal scorer, and optionally, the primary and secondary assists.

For example:

![network](https://dl.dropboxusercontent.com/u/15276022/scoring-network.png)

To give credit where credit is due, there is an emerging Passing Project that is doing some really cool stuff in this area.  

Check out the work here: https://hockey-graphs.com/tag/passing-project/ and a recent tweet of the sort of work they are doing to transcribe games into datasets.

![](https://pbs.twimg.com/media/CWcTzewU4AEQsvR.jpg)

With respect the API, I am going to use the elements returned in the Scoring Plays and think of the data in graph terms.

In [None]:
## build the dataset
scoring <- data.frame()
for (team in teams$slug) {
  sp_raw <- ss_get_result(ep="scoring_plays", 
                          query=list(team_id=team),
                          walk=TRUE)
  ## parse
  score_plays <- parse_stattle(sp_raw, "scoring_plays")
  score_players <- parse_stattle(sp_raw, "scoring_players")
  players <- parse_stattle(sp_raw, "players")
  ##cleanup
  players <- select(players, 
                    id, slug, name, position_abbreviation, 
                    salary, years_of_experience)
  players <- unique(players)
  names(players)[1] <- "player_id"
  score_plays <- select(score_plays, id, empty_net, period_number, 
                        period_seconds, scoring_type)
  names(score_plays)[1] <- "scoring_play_id"
  score_plays <- unique(score_plays)
  score_players <- select(score_players, player_id, role, scoring_play_id)
  ## create a rank variable, where scorers are 1, primary = 2, secondary help = 3
  score_players <- tbl_df(score_players) %>% 
    group_by(scoring_play_id) %>% 
    mutate(rank = 1:length(scoring_play_id)) %>% 
    ungroup
  ## add on player info to the scoring players
  dat <- left_join(score_players, players)
  ## add on the play info
  dat <- left_join(dat, score_plays)
  ## add the team
  dat$team <- team
  ## bind to the scoring dat
  scoring <- bind_rows(scoring, dat)
  ## status
  cat("finished ", team, "\n")
}

In [None]:
## quick look at what we can do
filter(scoring, scoring_play_id == '652ea514-c934-41c6-be57-19ee6bd5ed32') %>% 
 select(name, rank, role)

In [None]:
## create the datasets for the network
## create a players dataframe
players <- select(scoring, slug:years_of_experience) %>% 
  unique

## create the data for the edgelist
edges <- tbl_df(scoring) %>% 
  group_by(scoring_play_id) %>% 
  arrange(rank) %>% 
  transform(prev_player = lag(slug, 1)) %>% 
  ungroup 
edges <- filter(edges, !is.na(prev_player)) %>% 
  select(prev_player, slug)

In [None]:
## load it into igraph
## if you dont have it, simply install.packages("igraph")
g <- igraph::graph.data.frame(edges, directed=TRUE, players)

In [None]:
## a quick overview of what we loaded
g

In [None]:
## the always plotted hairball-- do you see patterns in the data?
plot(g)

In [None]:
## some summary stats on the graph
pr <- igraph::page_rank(g) ##pagerank
bt <- igraph::betweenness(g) #betweenness

In [None]:
## what came back
class(pr)
class(bt)

In [None]:
## put together the data for the players
graph_sum <- data.frame(player=names(bt), betweeness=bt, pagerank=pr$vector)
rownames(graph_sum) = NULL

In [None]:
## quick look
head(graph_sum)

In [None]:
ggplot(graph_sum, aes(x=pagerank, y=betweeness)) + 
 geom_point() + 
 geom_smooth()

In [None]:
## any guesses on the outlier?
filter(graph_sum, pagerank> .005 & betweeness > 10000)

### Next steps

There is alot more that you can do with this level of data, but one next step would be to use a tool like Neo4j to dive deeper into the patterns.  It's totally possible with igraph, but the cypher query language is really expressive and makes it easy to evaluate complex patterns. 

I previously entered the recent Neo4j Graphgist Competition where I used our API and Neo4j to demo how one might do a simple team ranking within the database.