## Narrowing down the features
We need to determine which feature seem most relevant for prediction of a player to be waived or traded.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

In [2]:
import os
print(os.path.exists("Data/merged_data/merged_data_collapsed_teams.csv"))

False


In [3]:
print(os.getcwd())

C:\Users\jandr\OneDrive\Documents\GitHub\predicting_nba_transactions\Data\merged_data


In [4]:
player_data = pd.read_csv("merged_data_collapsed_teams.csv")

In [5]:
player_data.sample()

Unnamed: 0,NAME,PLAYER_ID,SEASON_START,TEAMS_LIST,PLAYER_AGE,EXPERIENCE,POS,GP,GS,MIN,...,WAIVED_NEXT_NEXT_OFF,RELEASED_NEXT_NEXT_OFF,TRADED_NEXT_NEXT_OFF,WAIVED_NBA_YEAR,WAIVED_NEXT_NBA_YEAR,RELEASED_NBA_YEAR,RELEASED_NEXT_NBA_YEAR,TRADED_NBA_YEAR,TRADED_NEXT_NBA_YEAR,IN_LEAGUE_NEXT
1657,Glenn Robinson,299,1997,['MIL'],25.0,4,SF,56,56.0,2294.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


We ultimately want to decide which features best predict if a player will be on a different team by the end of next season based on his performance and salary in the current season. Since salary data is missing let's first drop all rows for which the salary entry is blank

In [6]:
player_data = player_data.dropna(subset=['Salary'])

In [7]:
player_data.sample()

Unnamed: 0,NAME,PLAYER_ID,SEASON_START,TEAMS_LIST,PLAYER_AGE,EXPERIENCE,POS,GP,GS,MIN,...,WAIVED_NEXT_NEXT_OFF,RELEASED_NEXT_NEXT_OFF,TRADED_NEXT_NEXT_OFF,WAIVED_NBA_YEAR,WAIVED_NEXT_NBA_YEAR,RELEASED_NBA_YEAR,RELEASED_NEXT_NBA_YEAR,TRADED_NBA_YEAR,TRADED_NEXT_NBA_YEAR,IN_LEAGUE_NEXT
478,Nick Van Exel,89,1994,['LAL'],23.0,2,PG,80,80.0,2944.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


To simplify things for the moment, let's look at a combined feature that indicates if a player is traded, waived, or released by the end of next season.

In [8]:
player_data['MOVED_BY_END_OF_NEXT_SEASON'] = player_data[['WAIVED_NBA_YEAR', 'WAIVED_NEXT_NBA_YEAR', 'RELEASED_NBA_YEAR', 'RELEASED_NEXT_NBA_YEAR', 'TRADED_NBA_YEAR', 'TRADED_NEXT_NBA_YEAR']].any(axis=1).astype(int)

In [9]:
player_data.sample(6)

Unnamed: 0,NAME,PLAYER_ID,SEASON_START,TEAMS_LIST,PLAYER_AGE,EXPERIENCE,POS,GP,GS,MIN,...,RELEASED_NEXT_NEXT_OFF,TRADED_NEXT_NEXT_OFF,WAIVED_NBA_YEAR,WAIVED_NEXT_NBA_YEAR,RELEASED_NBA_YEAR,RELEASED_NEXT_NBA_YEAR,TRADED_NBA_YEAR,TRADED_NEXT_NBA_YEAR,IN_LEAGUE_NEXT,MOVED_BY_END_OF_NEXT_SEASON
15273,Skylar Mays,1630219,2021,['ATL'],24.0,2,SG,28,5.0,220.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1
9288,Tyrus Thomas,200748,2014,['MEM'],28.0,8,PF,2,0.0,7.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
11589,Chris Singleton,202698,2013,['WAS'],24.0,3,SF,25,0.0,250.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1
8099,John Edwards,2823,2004,['IND'],23.0,1,C,25,1.0,139.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1
3761,Samaki Walker,955,1997,['DAL'],22.0,2,PF,41,19.0,1027.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0
2854,Jerry Stackhouse,711,2011,['ATL'],37.0,17,SF,30,0.0,273.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0


In [10]:
numeric_data = player_data.select_dtypes(include=['number'])

In [11]:
correlations = numeric_data.corr()['MOVED_BY_END_OF_NEXT_SEASON']

In [12]:
print(correlations)

PLAYER_ID                     -0.007206
SEASON_START                  -0.016470
PLAYER_AGE                     0.040112
EXPERIENCE                    -0.012069
GP                            -0.227963
                                 ...   
RELEASED_NEXT_NBA_YEAR         0.131092
TRADED_NBA_YEAR                0.404463
TRADED_NEXT_NBA_YEAR           0.518717
IN_LEAGUE_NEXT                -0.068808
MOVED_BY_END_OF_NEXT_SEASON    1.000000
Name: MOVED_BY_END_OF_NEXT_SEASON, Length: 76, dtype: float64


In [13]:
sorted_correlations = correlations.sort_values(ascending=False)
print(sorted_correlations)

MOVED_BY_END_OF_NEXT_SEASON    1.000000
TRADED_NEXT_NBA_YEAR           0.518717
WAIVED_NBA_YEAR                0.438904
WAIVED_NEXT_NBA_YEAR           0.432247
TRADED_NBA_YEAR                0.404463
                                 ...   
MIN                           -0.226591
GP                            -0.227963
FGM                           -0.229878
DWS                           -0.232815
WS                            -0.236431
Name: MOVED_BY_END_OF_NEXT_SEASON, Length: 76, dtype: float64


In [14]:
pd.set_option('display.max_rows', None)

# Print the sorted correlations
print(sorted_correlations)

# Reset display options if needed
pd.reset_option('display.max_rows')

MOVED_BY_END_OF_NEXT_SEASON    1.000000
TRADED_NEXT_NBA_YEAR           0.518717
WAIVED_NBA_YEAR                0.438904
WAIVED_NEXT_NBA_YEAR           0.432247
TRADED_NBA_YEAR                0.404463
TRADED_NEXT_OFF                0.388480
TRADED_NEXT_REG                0.360414
TRADED_NEXT_NEXT_OFF           0.351442
WAIVED_NEXT_OFF                0.345090
WAIVED_NEXT_NEXT_OFF           0.313604
WAIVED_NEXT_REG                0.300161
WAIVED_REG                     0.268713
RELEASED_NBA_YEAR              0.133510
RELEASED_NEXT_NBA_YEAR         0.131092
RELEASED_NEXT_OFF              0.128133
RELEASED_NEXT_NEXT_OFF         0.119370
TRADED_REG                     0.085308
WAIVED_OFF                     0.075564
TOV_PERCENT                    0.060260
TRADED_POST                    0.058124
TRADED_NEXT_POST               0.057035
RELEASED_NEXT_REG              0.051247
WAIVED_POST                    0.040306
PLAYER_AGE                     0.040112
X3P_AR                         0.038584


From this the highest positively correlated stats for being moved are:
TOV_PERCENT: 0.060260
PLAYER_AGE: 0.040112
X3P_AR: 0.038584 (I think this is 3-pt-attempt rate; the percentage of field goals attempted from 3-point range)

And the highest negatively correlated stats for being moved are:
WS: -0.236431 (Win share)
DWS: -0.232815 (Defensive win share)
FGM: -0.229878 
GP: -0.227963
MIN: -0.226591

However, each of these, and being waived are correlated to minutes, let's look at these stats as per minute to account for the fact that players who are waived play less than players who are not waived.


In [15]:
player_data = player_data[player_data['MIN'] != 0]

In [16]:
len(player_data)

11172

In [17]:
columns_to_normalize = ['FGM', 'FGA', 'PTS', 'PF', 'DREB', 'OREB', 'REB', 'FTA', 'FTM', 'STL', 'TOV', 'BLK', 'AST', 'FG3A', 'FG3M']

# Normalize the selected columns by dividing by 'MIN'
player_data[columns_to_normalize] = player_data[columns_to_normalize].div(player_data['MIN'], axis=0)

In [18]:
numeric_data = player_data.select_dtypes(include=['number'])

In [19]:
correlations = numeric_data.corr()['MOVED_BY_END_OF_NEXT_SEASON']

In [20]:
sorted_correlations = correlations.sort_values(ascending=False)
pd.set_option('display.max_rows', None)

# Print the sorted correlations
print(sorted_correlations)

# Reset display options if needed
pd.reset_option('display.max_rows')

MOVED_BY_END_OF_NEXT_SEASON    1.000000
TRADED_NEXT_NBA_YEAR           0.518883
WAIVED_NBA_YEAR                0.438674
WAIVED_NEXT_NBA_YEAR           0.432198
TRADED_NBA_YEAR                0.404590
TRADED_NEXT_OFF                0.388602
TRADED_NEXT_REG                0.360527
TRADED_NEXT_NEXT_OFF           0.351552
WAIVED_NEXT_OFF                0.345198
WAIVED_NEXT_NEXT_OFF           0.313473
WAIVED_NEXT_REG                0.300254
WAIVED_REG                     0.268282
RELEASED_NBA_YEAR              0.133551
RELEASED_NEXT_NBA_YEAR         0.131132
RELEASED_NEXT_OFF              0.128172
RELEASED_NEXT_NEXT_OFF         0.119406
TRADED_REG                     0.085334
WAIVED_OFF                     0.075623
PF                             0.061292
TOV_PERCENT                    0.060260
TRADED_POST                    0.058142
TRADED_NEXT_POST               0.057052
RELEASED_NEXT_REG              0.051262
WAIVED_POST                    0.040318
PLAYER_AGE                     0.040190


Now the most highest negative correlations to be waived are still win shares (total, defensive, and offensive). VORP (value over replacement player), PER (player efficiency rating), BPM (box plus/minus), points per minute, FGM/min, TS percent (true shooting percent), FG_PCT, FTM/min, FT_PCT, and percent team salary