# Introduction

The excitement level of millions of people rises each October as the NHL season begins. 31 teams, each playing 82 games, with the top teams qualifying to the Stanley Cup playoffs. The NHL is split into 2 conferences, East and West, with each conference further split into 2 Divisions. 8 teams from each conference qualify for the playoffs: the top 3 teams in each division, plus 2 “wild card” teams from each conference. These wild card teams are the ones that finish and in their conference. Can we use previous years data to predict which teams should be in the playoffs this year (2019)? This is the question I want to try and answer. As of writing this, the regular season has ended, so we can compare our machine-learned predictions to reality.

# Making the Data-set

The first step is making the data-set. We will use historical data starting from the salary cap era (2005). Even though the NHL has been in existence since 1917, prior to 2005, there was no restriction on the amount of spending on players. With the enforcement of a salary cap, the playing field became more level. The data was collected from the website: [Hockey Reference](http://hockey-reference.com/). I manually stitched together all the previous year's stats into the csv file called ["The PLAYOFFS - Data.csv](https://github.com/Newtonsboi/NHL-Playoff-Teams/blob/master/The%20PLAYOFFS%20-%20Data.csv). Then, for each of the teams, I assigned a 0 if the team made the playoffs that season, and 1 otherwise. The data consists of 33 features. 


In [15]:
import pandas as pd
combinedData = pd.read_csv("The PLAYOFFS - Data.csv")  # Import data
print(combinedData) # Show how the CSV File looks like

     Rk                      Team  AvAge  GP   W   L  OL  PTS   PTS%   GF  \
0     3      Carolina Hurricanes*   29.2  82  52  22   8  112  0.683  286   
1     4            Anaheim Ducks*   28.5  82  48  20  14  110  0.671  254   
2     1        Detroit Red Wings*   32.1  82  54  21   7  115  0.701  252   
3     9      Pittsburgh Penguins*   26.6  82  45  28   9   99  0.604  258   
4     3       Chicago Blackhawks*   26.6  82  52  22   8  112  0.683  262   
5     7            Boston Bruins*   28.3  82  46  25  11  103  0.628  244   
6    13        Los Angeles Kings*   26.7  82  40  27  15   95  0.579  188   
7     1       Chicago Blackhawks*   26.8  48  36   7   5   77  0.802  149   
8    10        Los Angeles Kings*   27.4  82  46  28   8  100  0.610  198   
9     7       Chicago Blackhawks*   29.3  82  48  28   6  102  0.622  220   
10    4      Pittsburgh Penguins*   29.0  82  48  26   8  104  0.634  241   
11    2      Pittsburgh Penguins*   28.7  82  50  21  11  111  0.677  278   

## Feature Selection
### Importing the Data

As the data-set is not large and we have a lot of features, I want to avoid over-fitting. To do so, I need to cut down the number of features. I want to be able to predict a playoff team without knowing the team’s ranking, or factors that influence ranking, as that would make the task trivial. So, I got rid of these features:

-   Ranking 

-   Games Played

-   Wins and Losses 

-   Strength of Schedule

-   Simple Rating System 

I also make sure we do not have repetitive features. For example, there is a Goals Scored feature, but also Even Strength Goals Scored, as well as Power Play Goals Scored, the summation of the latter two giving the former. So I removed the former. The result is that I am able to reduce the features from 33 to 19.

In [14]:
removeIndices = ["Rk","GP", "W", "L", "OL", "PTS", "PTS%","SOW","SOL","SRS","GA","GF", "SOS"] # Define features that influence ranking
combinedData = pd.read_csv("The PLAYOFFS - Data.csv")  # Import data
combinedData.iloc[:,1] = combinedData.iloc[:,1].map(lambda x: x.strip('*')) # Some datasets have a *, remove them
combinedData = combinedData.set_index("Team") # Set the team as the index
combinedData = combinedData.drop(removeIndices,axis=1) # Remove the unwanted features
combinedData = combinedData.sample(frac=1, random_state=421) # Shuffle the dataset

Next, I seperate the training data from the training targets. The targets are the 0 and 1 assigned to play-off and non-playoff teams

In [19]:
unFilteredtrainData = combinedData.iloc[:,:-1] # Seperate the training data from the training targets
trainTargets = combinedData.iloc[:,-1]

Then, I import this season's (2018 - 2019) team stats, and remove the same features as in the training data. As this is our test data, we do not have test targets (0s and 1s).

In [23]:
unFilteredtestData = pd.read_csv("The PLAYOFFS - 2018.csv") # Import test set, which is this year's data
unFilteredtestData.iloc[:,1] = unFilteredtestData.iloc[:,1].map(lambda x: x.strip('*'))
unFilteredtestData = unFilteredtestData.set_index("Team")
unFilteredtestData = unFilteredtestData.drop(removeIndices,axis=1)
unFilteredtestData = unFilteredtestData.sample(frac=1,  random_state=421) # Randomly shuffle

### Feature Extractor