# NHL WIN PREDICTION MACHINE LEARNING MODEL

### The Objective: 
The best NHL Win prediction machine learning models out there currently sit around a ~70% correct rate. The goal of this model is to get close to this percentage.

### Choosing the Model:
There are many different machine learning models we could use in this situation. For this model, I will be implementing a Linear Regression Model and a Random Forest Model. 

### Selecting the Stats: 
There are many different advanced NHL stats available today. For this model, I will be using 2 groups of stats. I will be using stats from the last 10 games as well as stats from the last 40 games. The last 40 games will give a greater overview of the team as a whole. Factors such as playstyle, coaching philosophies, and general strengths and weaknesses will be highlighted in these stats. The last 10 games will highlight streakiness, recent coaching/player changes, and who's hot. I believe a combination of the 2 will make for the best overall representation of a team for my model.

#### Team Stats:
- fenwickPercentage
- xGoalsFor
- flurryAdjustedxGoalsFor
- shotsOnGoalFor
- shotAttemptsFor
- goalsFor
- penalityMinutesFor
- 1 - fenwickPercentage (Fenwick% Against)


#### Goalie Stats (Starter):
- savedShotsOnGoalFor / shotsOnGoalFor (SV%)

#### Skater Stats:
Skater stats are a bit more complicated than Team Stats and Goalie Stats because there are so many more skaters on each team and each skater doesn't have the same impact as each other. To combat this I will be applying a weight to each starting skater's stats (minutes/game) and getting a team average of all the starting skaters.

In [26]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd



#### Step 1: Clean/Organize the data for the model

In [27]:
df = pd.read_csv('filtered_team_stats2.csv', nrows=3000)
display(df)

Unnamed: 0,team,season,name,gameId,playerTeam,opposingTeam,home_or_away,gameDate,position,situation,...,unblockedShotAttemptsAgainst,scoreAdjustedUnblockedShotAttemptsAgainst,dZoneGiveawaysAgainst,xGoalsFromxReboundsOfShotsAgainst,xGoalsFromActualReboundsOfShotsAgainst,reboundxGoalsAgainst,totalShotCreditAgainst,scoreAdjustedTotalShotCreditAgainst,scoreFlurryAdjustedTotalShotCreditAgainst,playoffGame
0,NYR,2008,NYR,2008020003,NYR,T.B,HOME,20081005,Team Level,all,...,32,31.984,5,0.241,0.000,0.000,1.091,1.117,1.091,0
1,NYR,2008,NYR,2008020010,NYR,CHI,HOME,20081010,Team Level,all,...,45,43.911,3,0.448,0.407,0.407,2.738,2.751,2.730,0
2,NYR,2008,NYR,2008020034,NYR,N.J,HOME,20081013,Team Level,all,...,39,38.019,3,0.383,1.139,1.139,2.698,2.691,2.242,0
3,NYR,2008,NYR,2008020044,NYR,BUF,HOME,20081015,Team Level,all,...,24,24.517,0,0.270,1.380,1.380,1.777,1.805,1.796,0
4,NYR,2008,NYR,2008020057,NYR,TOR,HOME,20081017,Team Level,all,...,33,33.549,2,0.406,0.130,0.130,1.662,1.701,1.697,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2995,PHI,2016,PHI,2016021227,PHI,CAR,HOME,20170409,Team Level,all,...,46,48.195,0,0.541,0.456,0.456,3.106,3.257,3.184,0
2996,PHI,2017,PHI,2017020066,PHI,WSH,HOME,20171014,Team Level,all,...,36,34.918,6,0.309,0.179,0.179,1.661,1.644,1.635,0
2997,PHI,2017,PHI,2017020083,PHI,FLA,HOME,20171017,Team Level,all,...,54,53.624,4,0.572,0.098,0.098,3.453,3.478,3.407,0
2998,PHI,2017,PHI,2017020097,PHI,NSH,HOME,20171019,Team Level,all,...,37,38.727,7,0.293,0.032,0.032,1.339,1.416,1.404,0


In [33]:
stats_we_want = ['gameId', 
                 'season', 
                 'gameDate', 
                 'team', 
                 'home_or_away',
                 'opposingTeam',
                 'fenwickPercentage', 
                 'xGoalsFor',
                 'flurryAdjustedxGoalsFor',
                 'shotsOnGoalFor',
                 'shotAttemptsFor',
                 'goalsFor',
                 'penalityMinutesFor',
                 
                 'xGoalsAgainst',
                 'flurryAdjustedxGoalsAgainst',
                 'shotsOnGoalAgainst',
                 'shotAttemptsAgainst',
                 'goalsAgainst',
                 'penalityMinutesAgainst'
                ]

filtered_df = (df[stats_we_want]).drop_duplicates(subset=['gameId'])
renamed_df = filtered_df.rename(columns={'team': 'HomeTeam',
                                            'opposingTeam': 'AwayTeam',
                                            'fenwickPercentage': 'Home_fenwickPercentage',
                                            'xGoalsFor': 'Home_xGoals',
                                            'flurryAdjustedxGoalsFor': 'Home_flurryAdjustedxGoals',
                                            'shotsOnGoalFor': 'Home_shotsOnGoal',
                                            'shotAttemptsFor': 'Home_shotAttempts',
                                            'goalsFor': 'Home_goals',
                                            'penalityMinutesFor': 'Home_penaltyMinutes',
                                            'xGoalsAgainst': 'Away_xGoals',
                                            'flurryAdjustedxGoalsAgainst': 'Away_flurryAdjustedxGoals',
                                            'shotsOnGoalAgainst': 'Away_shotsOnGoal',
                                            'shotAttemptsAgainst': 'Away_shotAttempts',
                                            'goalsAgainst': 'Away_goals',
                                            'penalityMinutesAgainst': 'Away_penaltyMinutes'
                                        })
renamed_df['Away_FenwickPercentage'] = 1 - renamed_df['Home_fenwickPercentage']
display(renamed_df)

Unnamed: 0,gameId,season,gameDate,HomeTeam,home_or_away,AwayTeam,Home_fenwickPercentage,Home_xGoals,Home_flurryAdjustedxGoals,Home_shotsOnGoal,Home_shotAttempts,Home_goals,Home_penaltyMinutes,Away_xGoals,Away_flurryAdjustedxGoals,Away_shotsOnGoal,Away_shotAttempts,Away_goals,Away_penaltyMinutes,Away_FenwickPercentage
0,2008020003,2008,20081005,NYR,HOME,T.B,0.6190,1.793,1.744,39,72,2,15,0.916,0.894,19,44,1,13,0.3810
1,2008020010,2008,20081010,NYR,HOME,CHI,0.4643,1.938,1.927,29,51,4,16,2.762,2.678,32,53,2,16,0.5357
2,2008020034,2008,20081013,NYR,HOME,N.J,0.4507,1.562,1.549,24,45,4,10,3.454,2.857,27,58,1,12,0.5493
3,2008020044,2008,20081015,NYR,HOME,BUF,0.5714,1.729,1.672,20,50,1,26,2.887,2.840,19,33,3,15,0.4286
4,2008020057,2008,20081017,NYR,HOME,TOR,0.6333,4.186,3.995,32,76,0,11,1.386,1.374,21,43,0,21,0.3667
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2995,2016021227,2016,20170409,PHI,HOME,CAR,0.5490,3.911,3.848,44,77,3,6,3.021,2.951,35,61,3,2,0.4510
2996,2017020066,2017,20171014,PHI,HOME,WSH,0.5663,3.246,3.179,37,57,8,2,1.531,1.507,23,52,2,4,0.4337
2997,2017020083,2017,20171017,PHI,HOME,FLA,0.4953,4.797,4.511,39,63,5,17,2.979,2.915,41,76,1,13,0.5047
2998,2017020097,2017,20171019,PHI,HOME,NSH,0.5256,2.384,2.318,28,58,0,4,1.078,1.069,24,54,1,10,0.4744


#### Step 2: Create the initial model
We are going to start by implementing a Random Forest Classifier Model using the Scikit-Learn library.

In [29]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=60, min_samples_split=12, random_state=1)

#### Step 3: Split up the data into training data and testing data
Since the goal of this model is to use past data to predict future data, we are going to use all games that happened before 2018 as our training data and all games that happened after 2018 as our testing data.