# NHL WIN PREDICTION MACHINE LEARNING MODEL

### The Objective: 
The best NHL Win prediction machine learning models out there currently sit around a ~70% correct rate. The goal of this model is to get close to this percentage.

### Choosing the Model:
There are many different machine learning models we could use in this situation. For this model, I will be implementing a Linear Regression Model and a Random Forest Model. 

### Selecting the Stats: 
There are many different advanced NHL stats available today. For this model, I will be using 2 groups of stats. I will be using stats from the last 10 games as well as stats from the last 40 games. The last 40 games will give a greater overview of the team as a whole. Factors such as playstyle, coaching philosophies, and general strengths and weaknesses will be highlighted in these stats. The last 10 games will highlight streakiness, recent coaching/player changes, and who's hot. I believe a combination of the 2 will make for the best overall representation of a team for my model.

#### Team Stats:
- fenwickPercentage
- xGoalsFor
- flurryAdjustedxGoalsFor
- shotsOnGoalFor
- shotAttemptsFor
- penalityMinutesFor
- 1 - fenwickPercentage (Fenwick% Against)


#### Goalie Stats (Starter):
- savedShotsOnGoalFor / shotsOnGoalFor (SV%)

#### Skater Stats:
Skater stats are a bit more complicated than Team Stats and Goalie Stats because there are so many more skaters on each team and each skater doesn't have the same impact as each other. To combat this I will be applying a weight to each starting skater's stats (minutes/game) and getting a team average of all the starting skaters.

In [160]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

#### Step 1: Clean/Organize the data for the model

In [161]:
df = pd.read_csv('filtered_team_stats2.csv', nrows=20000)
display(df)

Unnamed: 0,team,season,name,gameId,playerTeam,opposingTeam,home_or_away,gameDate,position,situation,...,unblockedShotAttemptsAgainst,scoreAdjustedUnblockedShotAttemptsAgainst,dZoneGiveawaysAgainst,xGoalsFromxReboundsOfShotsAgainst,xGoalsFromActualReboundsOfShotsAgainst,reboundxGoalsAgainst,totalShotCreditAgainst,scoreAdjustedTotalShotCreditAgainst,scoreFlurryAdjustedTotalShotCreditAgainst,playoffGame
0,NYR,2008,NYR,2008020003,NYR,T.B,HOME,20081005,Team Level,all,...,32,31.984,5,0.241,0.000,0.000,1.091,1.117,1.091,0
1,NYR,2008,NYR,2008020010,NYR,CHI,HOME,20081010,Team Level,all,...,45,43.911,3,0.448,0.407,0.407,2.738,2.751,2.730,0
2,NYR,2008,NYR,2008020034,NYR,N.J,HOME,20081013,Team Level,all,...,39,38.019,3,0.383,1.139,1.139,2.698,2.691,2.242,0
3,NYR,2008,NYR,2008020044,NYR,BUF,HOME,20081015,Team Level,all,...,24,24.517,0,0.270,1.380,1.380,1.777,1.805,1.796,0
4,NYR,2008,NYR,2008020057,NYR,TOR,HOME,20081017,Team Level,all,...,33,33.549,2,0.406,0.130,0.130,1.662,1.701,1.697,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19995,L.A,2020,L.A,2020020584,L.A,S.J,HOME,20210402,Team Level,all,...,52,57.263,1,0.571,0.263,0.263,4.235,4.565,4.508,0
19996,L.A,2020,L.A,2020020596,L.A,S.J,HOME,20210403,Team Level,all,...,35,36.514,2,0.397,0.779,0.779,2.885,3.031,2.999,0
19997,L.A,2020,L.A,2020020610,L.A,ARI,HOME,20210405,Team Level,all,...,47,51.296,2,0.493,2.144,2.144,2.718,2.915,2.791,0
19998,L.A,2020,L.A,2020020624,L.A,ARI,HOME,20210407,Team Level,all,...,38,40.281,2,0.396,1.054,1.054,1.695,1.800,1.783,0


In [162]:
stats_we_want = ['gameId', 
                 'season', 
                 'gameDate', 
                 'team', 
                 'home_or_away',
                 'opposingTeam',
                 'fenwickPercentage', 
                 'xGoalsFor',
                 'flurryAdjustedxGoalsFor',
                 'shotsOnGoalFor',
                 'shotAttemptsFor',
                 'goalsFor',
                 'penalityMinutesFor',
                 
                 'xGoalsAgainst',
                 'flurryAdjustedxGoalsAgainst',
                 'shotsOnGoalAgainst',
                 'shotAttemptsAgainst',
                 'goalsAgainst',
                 'penalityMinutesAgainst'
                ]

filtered_df = (df[stats_we_want]).drop_duplicates(subset=['gameId'])
final_data = filtered_df.rename(columns={'team': 'HomeTeam',
                                            'opposingTeam': 'AwayTeam',
                                            'fenwickPercentage': 'Home_fenwickPercentage',
                                            'xGoalsFor': 'Home_xGoals',
                                            'flurryAdjustedxGoalsFor': 'Home_flurryAdjustedxGoals',
                                            'shotsOnGoalFor': 'Home_shotsOnGoal',
                                            'shotAttemptsFor': 'Home_shotAttempts',
                                            'goalsFor': 'Home_goals',
                                            'penalityMinutesFor': 'Home_penaltyMinutes',
                                            'xGoalsAgainst': 'Away_xGoals',
                                            'flurryAdjustedxGoalsAgainst': 'Away_flurryAdjustedxGoals',
                                            'shotsOnGoalAgainst': 'Away_shotsOnGoal',
                                            'shotAttemptsAgainst': 'Away_shotAttempts',
                                            'goalsAgainst': 'Away_goals',
                                            'penalityMinutesAgainst': 'Away_penaltyMinutes'
                                        })
final_data['Away_FenwickPercentage'] = 1 - final_data['Home_fenwickPercentage']
final_data['Outcome'] = np.where(final_data['Home_goals'] > final_data['Away_goals'], 1, 0)
display(final_data)

Unnamed: 0,gameId,season,gameDate,HomeTeam,home_or_away,AwayTeam,Home_fenwickPercentage,Home_xGoals,Home_flurryAdjustedxGoals,Home_shotsOnGoal,...,Home_goals,Home_penaltyMinutes,Away_xGoals,Away_flurryAdjustedxGoals,Away_shotsOnGoal,Away_shotAttempts,Away_goals,Away_penaltyMinutes,Away_FenwickPercentage,Outcome
0,2008020003,2008,20081005,NYR,HOME,T.B,0.6190,1.793,1.744,39,...,2,15,0.916,0.894,19,44,1,13,0.3810,1
1,2008020010,2008,20081010,NYR,HOME,CHI,0.4643,1.938,1.927,29,...,4,16,2.762,2.678,32,53,2,16,0.5357,1
2,2008020034,2008,20081013,NYR,HOME,N.J,0.4507,1.562,1.549,24,...,4,10,3.454,2.857,27,58,1,12,0.5493,1
3,2008020044,2008,20081015,NYR,HOME,BUF,0.5714,1.729,1.672,20,...,1,26,2.887,2.840,19,33,3,15,0.4286,0
4,2008020057,2008,20081017,NYR,HOME,TOR,0.6333,4.186,3.995,32,...,0,11,1.386,1.374,21,43,0,21,0.3667,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19995,2020020584,2020,20210402,L.A,HOME,S.J,0.4348,2.287,2.197,30,...,0,4,3.927,3.865,36,67,3,4,0.5652,0
19996,2020020596,2020,20210403,L.A,HOME,S.J,0.5833,2.383,2.237,37,...,2,4,3.267,3.220,23,49,3,4,0.4167,0
19997,2020020610,2020,20210405,L.A,HOME,ARI,0.5104,3.393,3.046,38,...,2,10,4.369,4.184,33,60,5,6,0.4896,0
19998,2020020624,2020,20210407,L.A,HOME,ARI,0.5309,1.958,1.918,30,...,4,4,2.352,2.193,27,49,3,2,0.4691,1


#### Step 2: Create the initial model
We are going to start by implementing a Random Forest Classifier Model using the Scikit-Learn library.

In [163]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score

rf = RandomForestClassifier(n_estimators=70, min_samples_split=12, random_state=1)

#### Step 3: Split up the data into training data and testing data
Since the goal of this model is to use past data to predict future data, we are going to use all games that happened before 2018 as our training data and all games that happened after 2018 as our testing data.

In [164]:
y = final_data['Outcome'] #This is the target data

columns_to_drop = ['gameId', 'season', 'gameDate', 'HomeTeam', 'home_or_away', 'AwayTeam', 'Outcome', 'Home_goals', 'Away_goals']

x = final_data.drop(columns = columns_to_drop) #This is the input data

print(x.shape)

(20000, 12)


In [165]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=101)

In [166]:
rf.fit(x_train, y_train)

RandomForestClassifier(min_samples_split=12, n_estimators=70, random_state=1)

In [167]:
predictions = rf.predict(x_test)

In [168]:
acc = accuracy_score(y_test, predictions)
print(acc)

0.70525


A 70% accuracy score isn't bad...
This doesn't tell the full story

50 -> 0.702 <br>
60 -> 0.7045 <br>
70 -> 0.7053 <br>
80 -> 0.705 <br>
100 -> 0.7045

In [169]:
combined = pd.DataFrame(dict(actual=y_test, predictions=predictions))

pd.crosstab(index=combined['actual'], columns=combined['predictions'])

predictions,0,1
actual,Unnamed: 1_level_1,Unnamed: 2_level_1
0,1464,572
1,607,1357


In [170]:
precision_score(y_test, predictions)

0.7034733022291343

In [171]:
probs = rf.predict_proba(x_test)
print(probs)

[[0.45093707 0.54906293]
 [0.79494525 0.20505475]
 [0.67954155 0.32045845]
 ...
 [0.39329671 0.60670329]
 [0.45899182 0.54100818]
 [0.50472404 0.49527596]]
