# Entity Resolution of Player Metrics with Salary Data for NHL Hockey Players#

Before we can do any real work on our ML project, we need to merge our X and Y data which was obtained from different sources. Lets do that now.

In [10]:
import pandas as pd
import os

Step 1 will be to load in our data. Lets get our X's and check the shape. Recall that our X data is NHL performance data, with metrics like Goals, Assists, Games Played, for each player for each year from 2007-08 until 2023-24.

In [11]:
# Import all X data from 2007-08 to 2023-24
X_file_path = '../../Data/NHL_PlayerData_NaturalStatTrick/parquet'
player_metrics = pd.read_parquet(X_file_path)

# Change year column from 2007 to 2006-07 for year 2007 to 2024
player_metrics['Year'] = player_metrics['Year'].apply(lambda x: str(x - 1) + '-' + str(x)[2:])

player_metrics.shape

(16108, 97)

Continuing with step 1, now lets load in our Y values and again check the shape.

In [12]:
# Import all Y data from 2007-08 to 2023-24
Y_file_path = '../../Data/Salary/dataframe'

# Y_file_path contains csv files for each year from 2007-08 to 2023-24
salary_data = pd.DataFrame()
for year in range(2008, 2025):
    this_year = pd.read_csv(Y_file_path + '/season=' + str(year) + '.csv')
    salary_data = pd.concat([salary_data, this_year])

salary_data.shape

(22082, 24)

In [13]:
print(f'Player Metrics Columns: \n {player_metrics.columns}')
print()
print(f'Salary Data Columns: \n {salary_data.columns}')

Player Metrics Columns: 
 Index(['Position', 'Team', 'Player', 'TOI', 'GP', 'TOI/GP', 'Goals/60',
       'Total Assists/60', 'First Assists/60', 'Second Assists/60',
       'Total Points/60', 'IPP', 'Shots/60', 'SH%', 'ixG/60', 'iCF/60',
       'iFF/60', 'iSCF/60', 'iHDCF/60', 'Rush Attempts/60',
       'Rebounds Created/60', 'PIM/60', 'Total Penalties/60', 'Minor/60',
       'Major/60', 'Misconduct/60', 'Penalties Drawn/60', 'Giveaways/60',
       'Takeaways/60', 'Hits/60', 'Hits Taken/60', 'Shots Blocked/60',
       'Faceoffs Won/60', 'Faceoffs Lost/60', 'Faceoffs %', 'CF/60', 'CA/60',
       'CF%', 'FF/60', 'FA/60', 'FF%', 'SF/60', 'SA/60', 'SF%', 'GF/60',
       'GA/60', 'GF%', 'xGF/60', 'xGA/60', 'xGF%', 'SCF/60', 'SCA/60', 'SCF%',
       'HDCF/60', 'HDCA/60', 'HDCF%', 'HDGF/60', 'HDGA/60', 'HDGF%', 'MDCF/60',
       'MDCA/60', 'MDCF%', 'MDGF/60', 'MDGA/60', 'MDGF%', 'LDCF/60', 'LDCA/60',
       'LDCF%', 'LDGF/60', 'LDGA/60', 'LDGF%', 'On-Ice SH%', 'On-Ice SV%',
       'PDO', 'Off

As we can see, the maximum number of similar rows we can hope to find is 16108 as determined by our X's, since we were able to acquire more Y records than X.

## Assessing Player Metrics ##

Before we merge with our Y's, we should take a look at the player metrics. One thing I am concerned about is the possibility that individual players show up twice in a single year.

In [18]:
# View players that show up twice in one year in player_metrics
player_metrics[player_metrics.duplicated(subset=['Player', 'Year'], keep=False)].sort_values(by=['Player', 'Position', 'Year'])

Unnamed: 0,Position,Team,Player,TOI,GP,TOI/GP,Goals/60,Total Assists/60,First Assists/60,Second Assists/60,...,Birth Country,Nationality,Height (in),Weight (lbs),Draft Year,Draft Team,Draft Round,Round Pick,Overall Draft Position,Year
28,D,"PHI, T.B",Alexandre Picard,490.03333333333,24,20.418055555556,0.37,0.37,0.0,0.37,...,CAN,CAN,75,215,2003,PHI,3,17,85.0,2006-07
881,D,OTT,Alexandre Picard,886.88333333333,47,18.869858156028,0.41,0.54,0.14,0.41,...,CAN,CAN,75,215,2003,PHI,3,17,85.0,2007-08
1766,D,"CAR, OTT",Alexandre Picard,992.73333333333,54,18.383950617284,0.24,0.66,0.18,0.48,...,CAN,CAN,75,215,2003,PHI,3,17,85.0,2008-09
29,L,CBJ,Alexandre Picard,20.3,3,6.7666666666667,0.0,0.0,0.0,0.0,...,CAN,CAN,74,206,2004,CBJ,1,8,8.0,2006-07
882,L,CBJ,Alexandre Picard,103.25,15,6.8833333333333,0.0,0.58,0.58,0.0,...,CAN,CAN,74,206,2004,CBJ,1,8,8.0,2007-08
1765,L,CBJ,Alexandre Picard,64.966666666667,9,7.2185185185185,0.0,0.0,0.0,0.0,...,CAN,CAN,74,206,2004,CBJ,1,8,8.0,2008-09
12260,C,CAR,Sebastian Aho,1090.35,56,19.470535714286,1.32,1.82,1.32,0.5,...,FIN,72,176,2015,CAR,2,5,35,,2019-20
14204,C,CAR,Sebastian Aho,1462.1666666667,75,19.495555555556,1.48,1.27,0.98,0.29,...,FIN,72,176,2015,CAR,2,5,35,,2021-22
15098,C,CAR,Sebastian Aho,1311.55,67,19.575373134328,1.24,2.24,1.42,0.82,...,FIN,72,176,2015,CAR,2,5,35,,2022-23
15983,C,CAR,Sebastian Aho,1311.55,67,19.575373134328,1.24,2.24,1.42,0.82,...,FIN,72,176,2015,CAR,2,5,35,,2023-24


# Trivial Join #
player_metrics.columns includes Team, Player, Birth City, Birth Country, Nationality, Date of Birth, Year, Round Pick, Position \
salary_data.columns includes PLAYER, TEAM, AGE, POS, Date of Birth

We absolutely must join on Player and Year, but it would be nice to also join on another column in the event that two players with the exact same name are in the league at the same time. Date of Birth is also common to these two tables, and it would be awfully unlikely for two people to be in the league at the same time with the exact same names and birthdates. 

In [41]:
# Merge player_metrics and salary_data on columns Player, Year, Birth Date

# Convert all columns to lower case
player_metrics['Player'] = player_metrics['Player'].str.lower()
player_metrics['Team'] = player_metrics['Team'].str.lower()
player_metrics['Year'] = player_metrics['Year'].str.lower()
player_metrics['Position'] = player_metrics['Position'].str.lower()

salary_data['PLAYER'] = salary_data['PLAYER'].str.lower()
salary_data['TEAM'] = salary_data['TEAM'].str.lower()
salary_data['season'] = salary_data['season'].str.lower()
salary_data['POS'] = salary_data['POS'].str.lower()

We need to check if the data types and structure of both Birth Date columns align.

In [42]:
print(player_metrics["Date of Birth"])
print(salary_data["DATE OF BIRTH"])

0        1974-08-27
1        1983-04-30
2        1971-08-11
3        1983-09-27
4        1981-07-02
            ...    
16103    1996-11-28
16104    2003-02-24
16105    1996-07-08
16106    2003-05-29
16107    1994-01-05
Name: Date of Birth, Length: 16108, dtype: object
0       Feb. 15, 1972
1         May 2, 1980
2       Apr. 28, 1970
3       Mar. 18, 1977
4       Dec. 23, 1979
            ...      
1555    Feb. 15, 1998
1556     May 21, 1997
1557    Feb. 18, 1998
1558    Jun. 18, 1998
1559    Mar. 30, 2001
Name: DATE OF BIRTH, Length: 22082, dtype: object


The above shows that they are clearly different, so we need to convert one to have the same format as the other. Lets modify salary_data to have year-month-day formatting.

In [44]:
# Convert salary_data["DATE OF BIRTH"] to year-month-day format

# Preprocess dates to ensure consistency: add a period after each month if it is missing
salary_data["DATE OF BIRTH"] = salary_data["DATE OF BIRTH"].apply(lambda x: x[:3] + '.' + x[3:] if x[3] != '.' else x)

# Convert date to datetime format
salary_data["DATE OF BIRTH"] = pd.to_datetime(salary_data["DATE OF BIRTH"], format='%b. %d, %Y')

# Convert date to year-month-day format
salary_data["DATE OF BIRTH"] = salary_data["DATE OF BIRTH"].dt.strftime('%Y-%m-%d')

salary_data["DATE OF BIRTH"]

0       1972-02-15
1       1980-05-02
2       1970-04-28
3       1977-03-18
4       1979-12-23
           ...    
1555    1998-02-15
1556    1997-05-21
1557    1998-02-18
1558    1998-06-18
1559    2001-03-30
Name: DATE OF BIRTH, Length: 22082, dtype: object

That looks good! After a bit preprocessing to make sure the formatting was consistent, we now have consistent birth date formats for each of our dataframes. It seems that we should be about ready to join.

## Merging X and Y's ##

We can now try to merge our Player_Metrics dataframe with our Salary_data df on columns [Player, Year, DOB]

In [45]:
# Merge player_metrics and salary_data on columns Player, Year, Birth Date
merged_data = pd.merge(player_metrics, salary_data, how='inner', left_on=['Player', 'Year', 'Date of Birth'], right_on=['PLAYER', 'season', 'DATE OF BIRTH'])
merged_data.shape

(12357, 121)

In [47]:
# Compute the percentage of records we were able to join on
percentage_joined = merged_data.shape[0] / player_metrics.shape[0]
percentage_joined

0.7671343431835113

Not bad! Looks like we were able to join on 12357 / 16108 possible records, which is about 77% of our player performance data.

## Saving this data back to disk ##

Our final step for the entity resolution process is going to be to save the merged data to disk.

In [49]:
# Save the merged data to a csv file
output_dir = '../../Data/entitiesResolved'
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

output_file = output_dir + '/merged_data.csv'
merged_data.to_csv(output_file, index=False)