# Modelling the best predictors of NHL player salary (XGBoost)

In my last post I did an initial cleaning of this dataset, along with building a simple forest regression model to examing the best predictors of salary. Here I hoped to build on that initial model, using XGBoost to predict NHL players' salaries. I also wanted to compare and contrast the XGBoost and random forest models, looking at their accuracy (via root mean squared error comparisons on the test data) and by looking at the best predictors in the two datasets.

## 1. Data munging

This section replicates the data wrangling from my previous analyses, to get the data into a clean format that we can use for predictive modelling.

In [1]:
library('plyr')
library('stringr')
library('tidyverse')
library('magrittr')
library('scatterplot3d')
library('dummies')
library('randomForest')
library('xgboost')



Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
arrange():   dplyr, plyr
compact():   purrr, plyr
count():     dplyr, plyr
failwith():  dplyr, plyr
filter():    dplyr, stats
id():        dplyr, plyr
lag():       dplyr, stats
mutate():    dplyr, plyr
rename():    dplyr, plyr
summarise(): dplyr, plyr
summarize(): dplyr, plyr

Attaching package: ‘magrittr’

The following object is masked from ‘package:purrr’:

    set_names

The following object is masked from ‘package:tidyr’:

    extract



ERROR: Error in library("dummies"): there is no package called ‘dummies’


In [12]:
train = read.csv('train.csv')

test_x = read.csv('test.csv')

test_y = read.csv('test_salaries.csv')

#data cleaning
#impute missing data and fix problem categorical columns
#to do this we merge the train and test data into a single set

#add train/test column
test_x$TrainTest = "test"
train$TrainTest =  "train"

test = cbind(test_y, test_x)
all_data = rbind(train,test)

# make new column for undrafted
all_data$undrafted = is.na(all_data$DftRd)

#fill the Pr.St column with 'INT' for international players
all_data$Pr.St = mapvalues(all_data$Pr.St, from = "", to="INT")

#Make team boolean columns
#get the unique list of team acronymns
teams = c()
for( i in levels(all_data$Team)){
	x = strsplit(i, "/")
	for(y in x){
		teams = c(teams, y)
	}
}
teams = unique(teams)

# add columns with the team names as the header and 0 as values
for(team in teams){
	all_data[,team] = 0
}

#iterate through and record the teams for each player
for(i in 1:length(all_data$Team)){
	teams_of_person = strsplit(as.character(all_data$Team[i]), "/")[[1]]
	for(x in teams_of_person){
		all_data[,x][i] = 1	
	}
}

#Make position boolean columns
pos = c()
for( i in levels(all_data$Position)){
	x = strsplit(i, "/")
	for(y in x){
		pos = c(pos, y)
	}
}
pos = unique(pos)

# add columns with the pos names as the header and 0 as values
for(position in pos){
	all_data[,position] = 0
}

#iterate through and record the position(s) for each player
for(i in 1:length(all_data$Position)){
	pos_of_person = strsplit(as.character(all_data$Position[i]), "/")[[1]]
	for(x in pos_of_person){
		all_data[,x][i] = 1	
	}
}



#turn the born column into 
# an age column 
# 3 integer columns year:month:date

bday_parts = str_split_fixed(all_data$Born, "-",3)

#adjust year column to account for missing digits
birth_year = c()
for(year in bday_parts[,1]){
	if(as.numeric(year) < 10){
		yr = paste("20", year, sep="")
		birth_year = c(birth_year, yr)
	}else{
		yr = paste("19",year, sep="")
		birth_year = c(birth_year, yr)
	}
}

all_data$birth_year = as.numeric(birth_year)
all_data$birth_month = as.numeric(bday_parts[,2])
all_data$birth_day = as.numeric(bday_parts[,3])



#split Cntry and Nat to boolean columns

birth_country = levels(all_data$Cntry)
# add columns with the country of birth options
# note the Estonia for Uncle Leo
for(country in birth_country){
	c = paste("born", country, sep="_")

	all_data[,c] = 0
}

#iterate through and record the birth country of each player
for(i in 1:length(all_data$Cntry)){
	birth_country = all_data$Cntry[i]
	c = paste("born", birth_country, sep="_")
	all_data[,c][i] = 1	
}


nationality = levels(all_data$Nat)
for(country in nationality){
	c = paste("nation", country, sep="_")
	all_data[,c] = 0
}

#iterate through and record the birth country of each player
for(i in 1:length(all_data$Nat)){
	nationality = all_data$Nat[i]
	c = paste("nation", nationality, sep="_")
	all_data[,c][i] = 1	
}


# impute the missing value's median for numerical columns

#fill median values
#loop through the dataframe, filling each column with the median of 
#the existing values for the entire dataset
#where are there still missing values?

all_missing_list =  colnames(all_data)[colSums(is.na(all_data)) > 0]
length(all_missing_list) == 0
#if above true all values are imputed!

for( i in 1:length(all_missing_list)){
	#get the global median
	median_all = median(all_data[,all_missing_list[i]], na.rm =TRUE)
	#imput the missing values with the column's median
	all_data[,all_missing_list[i]][is.na(all_data[,all_missing_list[i]])] = median_all
}

#make a df copy so we can graph with names at the end.
graph_all_data = all_data
all_data = all_data[, !(colnames(all_data) %in% c("Last.Name","First.Name","Cntry","Nat","Born","Team","City","Position"))]
head(all_data)

train_dat = all_data[all_data$TrainTest == "train",]

test_dat = all_data[all_data$TrainTest == "test",]


#drop the train/test split columns
train_dat = train_dat[, !(colnames(train_dat) %in% c("TrainTest"))]
test_dat = test_dat[, !(colnames(test_dat) %in% c("TrainTest"))]


y_column = c("Salary")
all_columns = names(train_dat)
predictor_columns = all_columns[all_columns != y_column]


#Additional XGBoost Cleaning
#need to make these into dummy variables before passing into xgb.DMatrix

train_in = select(train_dat,one_of(predictor_columns))
test_in = select(test_dat,one_of(predictor_columns))
head(train_in)
head(test_in)

Salary,Pr.St,Ht,Wt,DftYr,DftRd,Ovrl,Hand,GP,G,⋯,nation_FRA,nation_GBR,nation_HRV,nation_LVA,nation_NOR,nation_RUS,nation_SVK,nation_SWE,nation_USA,nation_SVN
925000,QC,74,190,2015,1,18,L,1,0,⋯,0,0,0,0,0,0,0,0,0,0
2250000,ON,74,207,2012,1,15,R,79,2,⋯,0,0,0,0,0,0,0,0,0,0
8000000,MN,72,218,2006,1,7,R,65,19,⋯,0,0,0,0,0,0,0,0,1,0
3500000,ON,77,220,2010,1,3,R,30,1,⋯,0,0,0,0,0,0,0,0,0,0
1750000,ON,76,217,2012,1,16,R,82,7,⋯,0,0,0,0,0,0,0,0,0,0
1500000,ON,70,192,1997,6,156,L,80,5,⋯,0,0,0,0,0,0,0,0,0,0


Pr.St,Ht,Wt,DftYr,DftRd,Ovrl,Hand,GP,G,A,⋯,nation_FRA,nation_GBR,nation_HRV,nation_LVA,nation_NOR,nation_RUS,nation_SVK,nation_SWE,nation_USA,nation_SVN
QC,74,190,2015,1,18,L,1,0,0,⋯,0,0,0,0,0,0,0,0,0,0
ON,74,207,2012,1,15,R,79,2,15,⋯,0,0,0,0,0,0,0,0,0,0
MN,72,218,2006,1,7,R,65,19,26,⋯,0,0,0,0,0,0,0,0,1,0
ON,77,220,2010,1,3,R,30,1,5,⋯,0,0,0,0,0,0,0,0,0,0
ON,76,217,2012,1,16,R,82,7,12,⋯,0,0,0,0,0,0,0,0,0,0
ON,70,192,1997,6,156,L,80,5,12,⋯,0,0,0,0,0,0,0,0,0,0


Unnamed: 0,Pr.St,Ht,Wt,DftYr,DftRd,Ovrl,Hand,GP,G,A,⋯,nation_FRA,nation_GBR,nation_HRV,nation_LVA,nation_NOR,nation_RUS,nation_SVK,nation_SWE,nation_USA,nation_SVN
613,NY,72,216,2003,1,13,R,80,14,22,⋯,0,0,0,0,0,0,0,0,1,0
614,INT,72,195,2014,1,13,L,21,3,3,⋯,0,0,0,0,0,0,0,0,0,0
615,MO,75,227,2007,6,161,L,81,27,15,⋯,0,0,0,0,0,0,0,0,1,0
616,INT,72,182,2013,2,55,L,73,18,10,⋯,0,0,0,0,0,0,0,0,0,0
617,NY,72,196,2011,2,36,R,31,2,9,⋯,0,0,0,0,0,0,0,0,1,0
618,MN,74,210,2002,4,129,R,18,1,4,⋯,0,0,0,0,0,0,0,0,1,0


## 2. The XGBoost Model 

### Additional cleaning of data for XGBoost

The randomForest package in R can be passed categorical values in a dataframe and it will turn the data into dummy variables for us. This is not the case with XGBoost, which we must pass a numeric matrix, therefore we need to turn the remaining categorical columns into numeric dummies.

In [None]:
train_in = select(train_dat,one_of(predictor_columns))
test_in = select(test_dat,one_of(predictor_columns))

names(train_df)[4:length(names(train_df))]
head(train_in)

#change undrafted to 0 and 1
train_in$undrafted = as.numeric(train_in$undrafted)
test_in$undrafted = as.numeric(test_in$undrafted)
#change the hand to two booleans
train_in = cbind(train_in ,dummy(train_in$Hand))
test_in = cbind(test_in ,dummy(test_in$Hand))

# Pr.St check if the pr.st are same in each 
levels(train_in$Pr.St) == levels(test_in$Pr.St)
#same. therefore we can conduct the dummy creation
train_in = cbind(train_in ,dummy(train_in$Pr.St))
test_in = cbind(test_in ,dummy(test_in$Pr.St))

#drop the pre dummies
train_in = train_in[, !(colnames(train_in) %in% c("Hand","Pr.St"))]
test_in = test_in[, !(colnames(test_in) %in% c("Hand","Pr.St"))]


### Pass data to XGBoost
Now there are no strings to worry about and we can load in the numeric matrix in using the xgb.Matrix function

In [None]:
dtrain = xgb.DMatrix(data =  as.matrix(train_in), label = train_dat[,y_column])
dtest = xgb.DMatrix(data =  as.matrix(test_in), label = test_dat[,y_column])

### Train the model
Next we train the XGBoost model, note that note I am using a basic set of paramaters here, and one could tweak things such as the learning rate or number of rounds to optimize the model

In [None]:
watchlist = list(train=dtrain, test=dtest)
bst = xgb.train(data=dtrain, max.depth=8, eta=0.3, nthread = 2, nround=1000, watchlist=watchlist, objective = "reg:linear", early_stopping_rounds = 50)

### Best predictors
With the model trained which are the best predictors? Here we pass in the column names from the input dataframe and the bst model object into the xgb.importance function, which pairs the names with the columns to make the output more intrpretable. 

In [None]:
bst #look at the model
XGBoost_importance = xgb.importance(feature_names = names(train_in), model = bst)
XGBoost_importance #get the feature importance list

In [None]:

color.gradient <- function(x, colors=c("green", "yellow", "red"), colsteps=100) {
  return( colorRampPalette(colors) (colsteps) [ findInterval(x, seq(min(x),max(x), length.out=colsteps)) ] )
}

sd3 = scatterplot3d(graph_all_data$xGF, graph_all_data$DftYr,  graph_all_data$Salary, # x y and z 
                 pch=19, 
                 type="h", 
			cex.axis=0.5,
			las=1,
			lty.hplot=2,           
                	color=color.gradient(all_data$Salary,c("black","salmon")), 
			main="Interaction of age, goals and salary",
                 zlab="Salary",
                xlab="xGF:",
			ylab="Draft Year",
			grid=TRUE)	
	
sd3.coords = sd3$xyz.convert(graph_all_data$xGF, graph_all_data$DftYr,  graph_all_data$Salary) # convert 3D coords to 2D projection
text(sd3.coords$x, sd3.coords$y,labels=graph_all_data$Last.Name,cex=.5, pos=4)  



## Differences between the random forest regression and XGBoost models

When we ran the Random Forest regression in the last kernel, the test root mean squared error (rmse) was 1578497. The XGBoost model here had a slightly lower rmse of 1574073, suggesting that it did a better job fitting the data and making predictions than the random forest regression. 

### Top predictors of salary


Here for the XGBoost model the top predictors were

    1. xGF 
    2. DftYr 
    3. SF 
    4. Ovrl 
    5. FOL 
    6. FF 
    7. TOI.GP 
    8. iCF 
    9. GS.G
    10. RSA
    
For the random forest model from the last kernel, the top 10 predictors were: 
    1. DftYr  
    2. birth_year   
    3. TOI.GP.1      
    4. TOI.GP        
    5. TOI.         
    6. xGF         
    7. SF
    8. FF          
    9. GF        
    10.Ovrl     

The number of 'advanced stats' on these lists really suprises me, but it shows these stats are general enough to provide a good assessment of a player's skill across positions and play styles (if we assume that salary is a reasonable assessment of skill level when comparing players of the same age). For both goal scoring forwards and stay at home defencemen, the number of scoring chances generated while that player is on the ice is a good measure of a players effectiveness, and therefore a good predictor one the obvious factors such as age and the amount of playing time a player gets are taken into account.