## Table Of Contents:


### [Import Libraries](#Imports)

### [Read in the previously collected data](#Read-in-the-data)

### [Prepping the data using transformer pipelines](#Data-prep)


### Imports

In [299]:
# Data Analysis
import numpy as np
import pandas as pd
import datetime

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns

# Machine Learning
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LinearRegression

# Filter warnings
import warnings
warnings.filterwarnings('ignore')

### Read in the data

Here I will read in the train and test sets that I exported after completing the EDA.

In [291]:
# Import train and test sets
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')



In [292]:
# Check train set
df_train.head(3)

Unnamed: 0,season,team,name,birthday,age,nationality,height,weight,number,rookie,...,goals,pim,shots,shot_perc,games,hits,blocked,plusminus,shifts,points
0,20082009,New Jersey Devils,Travis Zajac,1985-05-13,35,CAN,"6' 2""",185,19,False,...,20,29,185,10.81,82,59,40,33,1895,62
1,20082009,New York Islanders,Johnny Boychuk,1984-01-19,36,CAN,"6' 2""",227,55,False,...,0,0,0,0.0,1,3,1,0,20,0
2,20082009,New York Islanders,Andrew Ladd,1985-12-12,34,CAN,"6' 3""",192,16,False,...,15,28,195,7.69,82,117,22,26,1803,49


In [293]:
# Check test set
df_test.head(3)

Unnamed: 0,season,team,name,birthday,age,nationality,height,weight,number,rookie,...,goals,pim,shots,shot_perc,games,hits,blocked,plusminus,shifts,points
0,20182019,New Jersey Devils,Travis Zajac,1985-05-13,35,CAN,"6' 2""",185,19,False,...,19,20,120,15.83,80,66,38,-25,1818,46
1,20182019,New Jersey Devils,P.K. Subban,1989-05-13,31,CAN,"6' 0""",210,76,False,...,9,60,168,5.36,63,56,75,5,1731,31
2,20182019,New Jersey Devils,Kyle Palmieri,1991-02-01,29,USA,"5' 11""",185,21,False,...,27,42,224,12.05,74,98,35,-9,1580,50


In [294]:
# Split X and Y variables
train_labels = df_train['points']
df_train = df_train.drop(['goals', 'assists', 'points'], axis=1)
df_train



Unnamed: 0,season,team,name,birthday,age,nationality,height,weight,number,rookie,...,sh_toi,ev_toi,pim,shots,shot_perc,games,hits,blocked,plusminus,shifts
0,20082009,New Jersey Devils,Travis Zajac,1985-05-13,35,CAN,"6' 2""",185,19,False,...,164:13,1096:38,29,185,10.81,82,59,40,33,1895
1,20082009,New York Islanders,Johnny Boychuk,1984-01-19,36,CAN,"6' 2""",227,55,False,...,00:34,13:26,0,0,0.00,1,3,1,0,20
2,20082009,New York Islanders,Andrew Ladd,1985-12-12,34,CAN,"6' 3""",192,16,False,...,78:38,1071:35,28,195,7.69,82,117,22,26,1803
3,20082009,New York Islanders,Andy Greene,1982-10-30,37,USA,"5' 11""",190,4,False,...,09:04,704:28,22,38,5.26,49,37,63,3,993
4,20082009,New York Islanders,Cal Clutterbuck,1987-11-18,32,CAN,"5' 11""",216,15,False,...,60:36,922:20,76,136,8.09,78,356,39,-5,1603
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2928,20172018,Vegas Golden Knights,William Carrier,1994-12-20,25,CAN,"6' 2""",218,28,False,...,00:39,322:38,19,52,1.92,37,113,10,-3,455
2929,20172018,Vegas Golden Knights,Tomas Nosek,1992-09-01,28,CZE,"6' 2""",205,92,False,...,101:30,634:22,14,92,7.61,67,53,24,6,1004
2930,20172018,Vegas Golden Knights,Alex Tuch,1996-05-10,24,USA,"6' 4""",220,89,False,...,03:30,1004:16,27,171,8.77,78,99,42,3,1366
2931,20172018,Vegas Golden Knights,Nicolas Roy,1997-02-05,23,CAN,"6' 4""",200,10,True,...,00:00,10:46,0,1,0.00,1,2,0,-1,16


In [295]:
# Split X and Y variables for testing
test_labels = df_test['points']
df_test = df_test.drop(['points', 'goals', 'assists'], axis=1)
df_test

Unnamed: 0,season,team,name,birthday,age,nationality,height,weight,number,rookie,...,sh_toi,ev_toi,pim,shots,shot_perc,games,hits,blocked,plusminus,shifts
0,20182019,New Jersey Devils,Travis Zajac,1985-05-13,35,CAN,"6' 2""",185,19,False,...,217:29,1112:40,20,120,15.83,80,66,38,-25,1818
1,20182019,New Jersey Devils,P.K. Subban,1989-05-13,31,CAN,"6' 0""",210,76,False,...,98:27,1166:17,60,168,5.36,63,56,75,5,1731
2,20182019,New Jersey Devils,Kyle Palmieri,1991-02-01,29,USA,"5' 11""",185,21,False,...,21:46,1077:44,42,224,12.05,74,98,35,-9,1580
3,20182019,New Jersey Devils,Fredrik Claesson,1992-11-24,27,SWE,"6' 1""",196,33,False,...,58:34,579:26,9,44,4.55,37,76,33,3,837
4,20182019,New Jersey Devils,Damon Severson,1994-08-07,26,CAN,"6' 2""",205,28,False,...,122:37,1483:03,58,146,7.53,82,83,87,-27,2155
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
674,20182019,Vegas Golden Knights,Shea Theodore,1995-08-03,25,CAN,"6' 2""",195,27,False,...,13:21,1383:22,20,202,5.94,79,27,91,-4,1938
675,20182019,Vegas Golden Knights,William Carrier,1994-12-20,25,CAN,"6' 2""",218,28,False,...,00:06,533:29,29,85,9.41,54,277,13,-4,700
676,20182019,Vegas Golden Knights,Tomas Nosek,1992-09-01,28,CZE,"6' 2""",205,92,False,...,116:08,704:55,18,116,6.90,68,75,23,-10,1055
677,20182019,Vegas Golden Knights,Alex Tuch,1996-05-10,24,USA,"6' 4""",220,89,False,...,00:27,1052:49,8,180,11.11,74,92,40,13,1344


### Data prep

In this section I will create some pipelines through which we will be able to pass our data and have it automatically prepared to be fed into a machine learning model. I'll do this with the help of [this tutorial.](https://towardsdatascience.com/custom-transformers-and-ml-data-pipelines-with-python-20ea2a7adb65)

The transformations that I had done in the initial cleaning steps (and that I'll need to work into the below pipelines) are:

<del>    1) Create a feature that returns the players age at the start of a season </del>
    
<del>    2) Transform the height feature into inches </del>
    
<del>     3) Clean up the time on ice metrics to represent seconds instead of MM:ss </del>
    
<del>    4) Create a feature: Presence of letter (Yes/No) </del>
    
<del>    5) Create a feature: Division </del>
    
<del>    6) Create a feature: Conference </del>
    
<del>    7) Create a feature: Birth Month </del>
    
<del>    8) Create a feature: Birth Season </del>
    
<del>    9) Impute any missing numeric values with the median for that column </del>
    
All of the explanations of the code and what each line does for these transformations are in the Cleaning & EDA workbook. I tried to put the comments in here as well, but it ended up being too messy.

In [300]:
#Custom Transformer that extracts columns passed as argument to its constructor 
class FeatureSelector(BaseEstimator, TransformerMixin):
    
    # Initialise function
    def __init__( self, feature_names ):
        self._feature_names = feature_names 
    
    #Return self nothing else to do here    
    def fit( self, X, y = None ):
        return self
    
    #
    def transform(self, X, y=None):
        return X[self._feature_names]

    
# Custom Transformer to create and deal with categorical features in the dataset
class CategoricalTransformer(BaseEstimator, TransformerMixin):
    
    # Initialise function with parameters and data dictionaries for calculations
    def __init__(self, conference=True, division=True, birth_month=True, birth_season=True, letter=True, 
                 team_info = team_info, birth_seasons=birth_seasons):
        self._conference = conference
        self._division = division
        self._birth_month = birth_month
        self._birth_season = birth_season
        self._letter = letter
     
    # Do nothing, only return self
    def fit(self, X, y=None):
        return self
    
    # Custom transformations
    def transform(self, X, y=None):
        
        # Conference
        if self._conference:
            X.loc[:,'conference'] = X['team'].apply(lambda x: team_info[x]['Conference'])
        
        # Division
        if self._division:
            X.loc[:,'division'] = X['team'].apply(lambda x: team_info[x]['Division'])
            
        # Birth Month & Season
        if (self._birth_month) and (self._birth_season):
            X.loc[:, 'birth_month'] = X['birthday'].apply(
                lambda x: datetime.datetime.strftime(pd.to_datetime(x), '%b')) 
            X.loc[:, 'birth_season'] = X['birth_month'].apply(lambda x: birth_seasons[x])
        
        elif self._birth_month:
            X.loc[:, 'birth_month'] = X['birthday'].apply(
                lambda x: datetime.datetime.strftime(pd.to_datetime(x), '%b'))
        
        elif self._birth_season:
            X.loc[:, 'birth_season'] = X['birthday'].apply(
                lambda x: birth_seasons[datetime.datetime.strftime(pd.to_datetime(x), '%b')])
        
        # Letter
        if self._letter:
            X.loc[:, 'letter'] = X['captain'].apply(lambda x: 'Yes' if x==True else 'No')
            X['letter'] = X[X['letter'] == 'No']['alternate_capt'].apply(lambda x: 'Yes' if x==True else 'No' )
            X['letter'].fillna('Yes', inplace=True)
                      
        # Return transformed dataframe
        return X
    
     
class NumericTransformer(BaseEstimator, TransformerMixin):
    
   # Initialise function with parameters
    def __init__(self, age=True, height=True, total_toi=True, ev_toi=True, pp_toi=True, sh_toi=True):
        self._age = age
        self._height = height
        self._total_toi = total_toi
        self._ev_toi = ev_toi
        self._pp_toi = pp_toi
        self._sh_toi = sh_toi

    # Do nothing, only return self
    def fit(self, X, y=None):
        return self
    
    # Custom transformations
    def transform(self, X, y=None):
        
        if self._age:
            X['season_start'] = X['season'].apply(lambda x: pd.to_datetime(f"{str(x)[:4]}-10-01"))
            X['birthday'] = X['birthday'].apply(lambda x: pd.to_datetime(str(x)))
            X['age_season_start'] = round((X['season_start'] - X['birthday']).dt.days/365).astype(int)

        if self._height:
            X['inches'] = X['height'].apply(lambda x: (int(x[0])*12) + int(x.split(' ')[1].replace('"','')))

        if self._total_toi:
            X['toi_secs'] = X['toi'].apply(lambda x: int(x.split(':')[0])*60 + int(x.split(':')[1]))

        if self._ev_toi:
            X['ev_toi_secs'] = X['ev_toi'].apply(lambda x: int(x.split(':')[0])*60 + int(x.split(':')[1]))
        
        if self._pp_toi:
            X['pp_toi_secs'] = X['pp_toi'].apply(lambda x: int(x.split(':')[0])*60 + int(x.split(':')[1]))

        if self._sh_toi:
            X['sh_toi_secs'] = X['sh_toi'].apply(lambda x: int(x.split(':')[0])*60 + int(x.split(':')[1]))

        # Return transformed dataframe
        return X
    

In [297]:
# Data dictionary to hold conference and division for each team
team_info = {'New Jersey Devils':{'Conference':'Eastern','Division':'Metropolitan'},
             'New York Islanders':{'Conference':'Eastern', 'Division':'Metropolitan'},
             'New York Rangers':{'Conference':'Eastern', 'Division':'Metropolitan'},
             'Philadelphia Flyers':{'Conference':'Eastern', 'Division':'Metropolitan'}, 
             'Pittsburgh Penguins':{'Conference':'Eastern', 'Division':'Metropolitan'},
             'Boston Bruins':{'Conference':'Eastern', 'Division':'Atlantic'},
             'Buffalo Sabres':{'Conference':'Eastern', 'Division':'Atlantic'},
             'Montréal Canadiens':{'Conference':'Eastern', 'Division':'Atlantic'}, 
             'Ottawa Senators':{'Conference':'Eastern', 'Division':'Atlantic'},
             'Toronto Maple Leafs':{'Conference':'Eastern', 'Division':'Atlantic'},
             'Carolina Hurricanes':{'Conference':'Eastern', 'Division':'Metropolitan'},
             'Florida Panthers':{'Conference':'Eastern', 'Division':'Atlantic'},
             'Tampa Bay Lightning':{'Conference':'Eastern', 'Division':'Atlantic'},
             'Washington Capitals':{'Conference':'Eastern', 'Division':'Metropolitan'},
             'Chicago Blackhawks':{'Conference':'Western', 'Division':'Central'},
             'Detroit Red Wings':{'Conference':'Eastern', 'Division':'Atlantic'},
             'Nashville Predators':{'Conference':'Western', 'Division':'Central'},
             'St. Louis Blues':{'Conference':'Western', 'Division':'Central'},
             'Calgary Flames':{'Conference':'Western', 'Division':'Pacific'},
             'Edmonton Oilers':{'Conference':'Western', 'Division':'Pacific'},
             'Vancouver Canucks':{'Conference':'Western', 'Division':'Pacific'},
             'Anaheim Ducks':{'Conference':'Western', 'Division':'Pacific'},
             'Dallas Stars':{'Conference':'Western', 'Division':'Central'},
             'Los Angeles Kings':{'Conference':'Western', 'Division':'Pacific'},
             'San Jose Sharks':{'Conference':'Western', 'Division':'Pacific'},
             'Columbus Blue Jackets':{'Conference':'Eastern', 'Division':'Metropolitan'},
             'Minnesota Wild':{'Conference':'Western', 'Division':'Central'},
             'Winnipeg Jets':{'Conference':'Western', 'Division':'Central'},
             'Arizona Coyotes':{'Conference':'Western', 'Division':'Pacific'},
             'Vegas Golden Knights':{'Conference':'Western', 'Division':'Pacific'},
             'Colorado Avalanche':{'Conference':'Western', 'Division':'Central'}}


birth_seasons = {'Jan':'Winter' , 'Feb':'Winter', 'Mar':'Spring', 'Apr':'Spring', 
                   'May':'Spring', 'Jun':'Summer', 'Jul':'Summer', 'Aug':'Summer', 
                   'Sep':'Fall', 'Oct':'Fall', 'Nov':'Fall', 'Dec':'Winter'}


In [260]:
categorical_features = ['team', 'birthday', 'nationality', 'rookie', 'position_code',
                       'position_type', 'handedness']
numeric_features = ['weight', 'pim', 'shots', 'shot_perc', 'games', 'hits', 'blocked', 'plusminus', 'shifts']


categorical_pipe = Pipeline(steps= [
    ('cat_transformer', CategoricalTransformer())#,
    #('one_hot_encode', OneHotEncoder(sparse=False))
    ])

numerical_pipe = Pipeline(steps = [
    ('num_transformer', NumericTransformer()),
    ('imputer', SimpleImputer(strategy='median')),
    ('std_scaler', StandardScaler())   
])

full_pipeline = ColumnTransformer([
    ('cat_pipeline', categorical_pipe, categorical_features),
    ('num_pipeline', numerical_pipe, numeric_features)                                              
])

In [261]:
train_prepared = full_pipeline.fit_transform(df_train)

In [262]:
train_prepared.shape

(2933, 16)

In [213]:
# Create instance of regressor
lin_reg = LinearRegression()
lin_reg.fit(X=train_prepared, y=train_labels)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [214]:
# Now that we have a fit model, let's try and make predictions on
# new data!
some_data = df_test.iloc[:5]
some_labels = test_labels.iloc[:5]

# Transform new data
some_data_prepared = full_pipeline.transform(some_data)

# Print predictions
lin_reg.predict(some_data_prepared)

array([32.51367188, 43.69726562, 50.84960938, 11.85546875, 30.44140625])

In [216]:
test_labels[:5]

0    46
1    31
2    50
3     6
4    39
Name: points, dtype: int64

In [257]:
pd.DataFrame(train_prepared, columns=['team', 'birthday', 'nationality', 'rookie', 'position_code',
                       'position_type', 'handedness', 'weight', 'pim', 'shots', 'shot_perc', 'games', 'hits', 'blocked', 'plusminus', 'shifts'])

Unnamed: 0,team,birthday,nationality,rookie,position_code,position_type,handedness,weight,pim,shots,shot_perc,games,hits,blocked,plusminus,shifts
0,New Jersey Devils,1985-05-13,CAN,False,C,Forward,R,-1.10947,-0.0903949,0.895125,0.419269,0.84509,-0.282424,-0.221239,2.76384,0.754164
1,New York Islanders,1984-01-19,CAN,False,D,Defenseman,R,1.63763,-1.1978,-1.55366,-1.39736,-2.34732,-1.2219,-1.13015,-0.0955403,-2.02628
2,New York Islanders,1985-12-12,CAN,False,L,Forward,L,-0.651618,-0.128581,1.02749,-0.105049,0.84509,0.690604,-0.640737,2.15731,0.617737
3,New York Islanders,1982-10-30,USA,False,D,Defenseman,L,-0.782433,-0.3577,-1.05067,-0.513412,-0.455521,-0.651503,0.314786,0.164404,-0.583415
4,New York Islanders,1987-11-18,CAN,False,R,Forward,R,0.918153,1.70437,0.246528,-0.0378289,0.68744,4.70015,-0.244544,-0.52878,0.321156
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2928,Vegas Golden Knights,1994-12-20,CAN,False,L,Forward,L,1.04897,-0.472259,-0.865352,-1.0747,-0.92847,0.623499,-0.920402,-0.355484,-1.38122
2929,Vegas Golden Knights,1992-09-01,CZE,False,L,Forward,L,0.198674,-0.663191,-0.335886,-0.118493,0.253903,-0.383082,-0.594126,0.424348,-0.567103
2930,Vegas Golden Knights,1996-05-10,USA,False,R,Forward,R,1.17978,-0.166768,0.709811,0.0764456,0.68744,0.38863,-0.174628,0.164404,-0.0302921
2931,Vegas Golden Knights,1997-02-05,CAN,True,C,Forward,R,-0.128361,-1.1978,-1.54042,-1.39736,-2.34732,-1.23868,-1.15346,-0.182188,-2.03221


In [278]:
CategoricalTransformer.fit_transform(CategoricalTransformer(), df_train)

Unnamed: 0,season,team,name,birthday,age,nationality,height,weight,number,rookie,...,games,hits,blocked,plusminus,shifts,conference,division,birth_month,birth_season,letter
0,20082009,New Jersey Devils,Travis Zajac,1985-05-13,35,CAN,"6' 2""",185,19,False,...,82,59,40,33,1895,Eastern,Metropolitan,May,Spring,Yes
1,20082009,New York Islanders,Johnny Boychuk,1984-01-19,36,CAN,"6' 2""",227,55,False,...,1,3,1,0,20,Eastern,Metropolitan,Jan,Winter,No
2,20082009,New York Islanders,Andrew Ladd,1985-12-12,34,CAN,"6' 3""",192,16,False,...,82,117,22,26,1803,Eastern,Metropolitan,Dec,Winter,No
3,20082009,New York Islanders,Andy Greene,1982-10-30,37,USA,"5' 11""",190,4,False,...,49,37,63,3,993,Eastern,Metropolitan,Oct,Fall,No
4,20082009,New York Islanders,Cal Clutterbuck,1987-11-18,32,CAN,"5' 11""",216,15,False,...,78,356,39,-5,1603,Eastern,Metropolitan,Nov,Fall,No
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2928,20172018,Vegas Golden Knights,William Carrier,1994-12-20,25,CAN,"6' 2""",218,28,False,...,37,113,10,-3,455,Western,Pacific,Dec,Winter,No
2929,20172018,Vegas Golden Knights,Tomas Nosek,1992-09-01,28,CZE,"6' 2""",205,92,False,...,67,53,24,6,1004,Western,Pacific,Sep,Fall,No
2930,20172018,Vegas Golden Knights,Alex Tuch,1996-05-10,24,USA,"6' 4""",220,89,False,...,78,99,42,3,1366,Western,Pacific,May,Spring,No
2931,20172018,Vegas Golden Knights,Nicolas Roy,1997-02-05,23,CAN,"6' 4""",200,10,True,...,1,2,0,-1,16,Western,Pacific,Feb,Winter,No


In [320]:
categorical_features = ['team', 'birthday', 'nationality', 'rookie', 'position_code',
                       'position_type', 'handedness', 'captain', 'alternate_capt']
numeric_features = ['weight', 'pim', 'shots', 'shot_perc', 'games', 'hits', 'blocked', 'plusminus', 'shifts', 
                    'birthday', 'season', 'height', 'toi' ,'ev_toi' ,'pp_toi', 'sh_toi']


categorical_pipe = Pipeline(steps= [
    ('cat_selector', FeatureSelector(categorical_features)),
    ('cat_transformer', CategoricalTransformer())#,
    #('one_hot_encode', OneHotEncoder(sparse=False))
    ])

numerical_pipe = Pipeline(steps = [
    ('num_selector', FeatureSelector(numeric_features)),
    ('num_transformer', NumericTransformer())#,
    #('imputer', SimpleImputer(strategy='median')),
    #('std_scaler', StandardScaler())   
])

full_pipeline = FeatureUnion(transformer_list= [
    ('cat_pipeline', categorical_pipe),
    ('num_pipeline', numerical_pipe)])

In [321]:
new_prep = full_pipeline.fit_transform(df_train)

In [322]:
pd.DataFrame(new_prep)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,27,28,29,30,31,32,33,34,35,36
0,New Jersey Devils,1985-05-13,CAN,False,C,Forward,R,False,True,Eastern,...,1096:38,268:16,164:13,2008-10-01,23,74,91747,65798,16096,9853
1,New York Islanders,1984-01-19,CAN,False,D,Defenseman,R,False,False,Eastern,...,13:26,00:48,00:34,2008-10-01,25,74,888,806,48,34
2,New York Islanders,1985-12-12,CAN,False,L,Forward,L,False,False,Eastern,...,1071:35,30:08,78:38,2008-10-01,23,75,70821,64295,1808,4718
3,New York Islanders,1982-10-30,USA,False,D,Defenseman,L,False,False,Eastern,...,704:28,84:43,09:04,2008-10-01,26,71,47895,42268,5083,544
4,New York Islanders,1987-11-18,CAN,False,R,Forward,R,False,False,Eastern,...,922:20,31:28,60:36,2008-10-01,21,71,60864,55340,1888,3636
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2928,Vegas Golden Knights,1994-12-20,CAN,False,L,Forward,L,False,False,Western,...,322:38,03:50,00:39,2017-10-01,23,74,19627,19358,230,39
2929,Vegas Golden Knights,1992-09-01,CZE,False,L,Forward,L,False,False,Western,...,634:22,07:18,101:30,2017-10-01,25,74,44590,38062,438,6090
2930,Vegas Golden Knights,1996-05-10,USA,False,R,Forward,R,False,False,Western,...,1004:16,181:56,03:30,2017-10-01,21,76,71382,60256,10916,210
2931,Vegas Golden Knights,1997-02-05,CAN,True,C,Forward,R,False,False,Western,...,10:46,00:00,00:00,2017-10-01,21,76,646,646,0,0
