# Pythagorean Expectation in the NHL
### Abstract
> In this project, I use the Pythagorean Expectation to investigate how goals scored by an NHL team and its opponents over a season can predict the team's success rate (as measured by winning percentage).
Using a combination of metrics and visualisations, I assess the predictive power of the Pythagorean Expectation, before trying my hand at optimizing the formula based on mean squared error.

## Introduction
In 1977, baseball analyst Bill James exposed his formula for relating a team’s win percentage to total runs scored and allowed:

$$ win\ percentage \approx {100}\ *\ \frac{(runs\ scored)^2}{(runs\ scored)^2 + (runs\ allowed)^2}\ $$

Simply put, teams that accumulated larger positive run differentials during a span of games were expected to have won a higher percentage of them (and vice versa). The formula was dubbed the Pythagorean Expectation for its resemblance to the well-known geometry theorem.
<br />
<br />
In many cases, the formula’s output has proven to be a better predictor of teams’ future success than current win percentage. Teams that significantly under- or outperformed the expected rate in the early part of a given season, presumably due to the randomness associated with the outcome of close games, tended to regress to the predicted win rate in the end.
<br />

Due to its simplicity and predictive potency, Pythagorean Expectation has been co-opted and adapted to other score-based team sports. Amendments to the original formula have mostly revolved around tinkering with the exponent value so as to minimize the model's error margin (Bill James himself suggested using a value of 1.82 for improved accuracy). Perhaps most notably, NBA executive presented his "Modified Pythagorean Expectation" in 1994, using an exponent of 16.5.

Prior to any optimization efforts, the Pythagorean Expectation can be formulated as follows:

$$ Pythagorean\ expectation = {100}\ *\ \frac{(points\ scored)^\alpha}{(points\ scored)^\alpha + (points\ allowed)^\alpha}\ $$
<br />
My objective us to use the latest NHL data to assess how to the Pythagorean Expectation fares with hockey, and to propose a value of alpha that optimizes the model's predictive power.<br />
Using basic visualizations, I investigate the 

## Data
### Sourcing
The Pythagorean Expectation has permeated sports analytics in large part thanks to how few and transferable its input variables are. For our purposes, baseball runs can obviously be equated to hockey goals, and until further notice, a win is a win no matter the game played.

I scraped the wins and goals data for the past 10 NHL seasons (starting 2012/13) directly from the league's official website.

<!-- ![Tux, the Linux mascot](img/raw_data.png) -->

<img src="img/raw_data.png" alt="drawing" width="600"/>

### Overtime losses
The NHL has a particularity that adds a small wrinkle to the analysis: standings within a season rely on a points system rather the win percentage. The point system works as follows:
- a win is worth 2 points
- an overtime/shootout loss is worth 1 point
- a regulation loss is worth 0 points

To reflect this ranking system when computing win percentage, I treat overtime losses as 'half-wins', leading to the following metric:

$$ win\ percentage = \frac{wins + 0.5*overtime\ losses}{wins + overtime\ losses + regulation\ losses} $$

### Subsampling
I decided to focus on a subset of teams that should provide interesting study cases. My rationale for each pick is a fallows:

1. __Colorado Avalanche:__ the defending Stanley Cup champions
2. __Tampa Bay Lightning:__ arguably the most succesful team in recent year
3. __Toronto Maple Leafs:__ a team returning to prominance, led by superstar Auston Matthews
3. __Montreal Canadians:__ a personal choice, and a team on the decline
4. __Ottawa Senator:__ a franchise that has struggled to overcome its small-market status
5. __Buffalo Sabres:__ with all due respect, the NHL's perennial bottom-feeder

## Exploratory Data Analysis
### Winning Percentage
Ultimately, the Pythagorean Expectation was created as a model for winning percentage. Before delving into the analytics, I want to gain an overview of how the target variable is distributed.

The above graphs also provide some substance for my choice of team to focus on.
The lines clearly show Tampa Bay's run of sustained success (TBL), the rise of Colorado (COL) and Toronto (TOR), Montreal's declining season-to-season record (MTL), and the continued struggle of Ottawa (OTT) and Buffalo (BUF)

### Goals Scored vs. Allows

### Input Synergies
*The team that scores more points that the other wins the game.*

Sound simple enough, right? Unfortunately the relationship between points scored and wins quickly becomes murky when looked at over the span of multiple games.