## NBA Salary Analysis 2022-23 Season

Importing necessary libraries and settings:

In [13]:
import pandas as pd
import numpy as np
import os
from IPython.display import HTML, display, Markdown

pd.set_option('display.max_columns', None)

Loading paths and files:

In [14]:
current_directory = os.getcwd()
data_folder = os.path.join(os.path.dirname(current_directory), 'data')
file_path = os.path.join(data_folder, '2023StatsAndSalaries.csv')

work = pd.read_csv(file_path)
df = work.copy()

Reviewing the Dataframe:

In [15]:
title = "Data Frame Info:"
display(Markdown(f"### {title}"))
display(df.head())
print("df shape:", df.shape,
 "\nDF column names and data types:\n",
  df.dtypes)

### Data Frame Info:

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,eFG%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,PlayerID,salary
0,Precious Achiuwa,C,23,TOR,55,12,1140,196,404,0.485,29,108,0.269,167,296,0.564,0.521,87,124,0.702,100,228,328,50,31,30,59,102,508,achiupr01,2840160
1,Steven Adams,C,29,MEM,42,42,1133,157,263,0.597,0,1,0.0,157,262,0.599,0.597,47,129,0.364,214,271,485,97,36,46,79,98,361,adamsst01,17926829
2,Bam Adebayo,C,25,MIA,75,75,2598,602,1114,0.54,1,12,0.083,601,1102,0.545,0.541,324,402,0.806,184,504,688,240,88,61,187,208,1529,adebaba01,30351780
3,Ochai Agbaji,SG,22,UTA,59,22,1209,165,386,0.427,81,228,0.355,84,158,0.532,0.532,56,69,0.812,43,78,121,67,16,15,41,99,467,agbajoc01,3918360
4,Santi Aldama,PF,22,MEM,77,20,1682,247,525,0.47,94,266,0.353,153,259,0.591,0.56,108,144,0.75,85,286,371,97,45,48,60,143,696,aldamsa01,2094120


df shape: (464, 31) 
DF column names and data types:
 Player       object
Pos          object
Age           int64
Tm           object
G             int64
GS            int64
MP            int64
FG            int64
FGA           int64
FG%         float64
3P            int64
3PA           int64
3P%         float64
2P            int64
2PA           int64
2P%         float64
eFG%        float64
FT            int64
FTA           int64
FT%         float64
ORB           int64
DRB           int64
TRB           int64
AST           int64
STL           int64
BLK           int64
TOV           int64
PF            int64
PTS           int64
PlayerID     object
salary        int64
dtype: object


In [16]:
#setting player name as index for readability
df.set_index('Player', inplace=True)

#### Feature Creation Methods:

In [17]:
def create_columns(df):
    '''Creating columns that may prove important for analysis'''
    #copying df
    df = df.copy()
    #stat creation
    df['PPG'] = (df['PTS'] / df['G'])
    df['APG'] = (df['AST'] / df['G'])
    df['TRPG'] = (df['TRB'] / df['G'])
    df['SPG'] = (df['STL'] / df['G'])
    df['BPG'] = (df['BLK'] / df['G'])
    df['MPG'] = (df['MP'] / df['G'])
    #dollar stat creation
    df['dollarPerMinute'] = (df['salary'] / df['MP']).round(2)
    df['dollarPerFG'] = (df['salary'] / df['FG']).round(2)
    df['dollarPerPoint'] = (df['salary'] / df['PTS']).round(2)
    #some players have 0 turnovers so we have to handle for when we divide by 0.
    df['AST/TO'] = np.where(df['TOV'] != 0, df['AST'] / df['TOV'], np.nan)
    
    # Replace infinite values with NaN
    inf_cols = ['AST/TO', 'dollarPerMinute', 'dollarPerFG', 'dollarPerPoint']
    df[inf_cols] = df[inf_cols].replace([np.inf, -np.inf], np.nan)

    return df


Running the feature method.

In [18]:
df = create_columns(df)

### Correlation:

Creation a correlation matrix to examine what features correlate highest with salary. We remove the "dollar per" stats as those are for exploratory purposes and won't correlate properly.

In [19]:
#CORRELATION
salaryCorr = pd.DataFrame()

df2 = df.copy()
df2 = df2.drop(columns=['dollarPerMinute','dollarPerFG', 'dollarPerPoint'], axis=1)
salaryCorr['All'] = df2.corrwith(df2['salary'], numeric_only=True).sort_values(ascending=False)[1:]

positions = ['PG', 'SG', 'SF', 'PF', 'C']
for position in positions:
    position_corr = df[df['Pos'] == position].corrwith(df['salary'], numeric_only=True).sort_values(ascending=False)[1:]
    salaryCorr[position] = position_corr

In [43]:
display(salaryCorr.head())

Unnamed: 0,All,PG,SG,SF,PF,C
PPG,0.722858,0.663703,0.70776,0.723118,0.747987,0.819297
PTS,0.652559,0.624945,0.660861,0.541605,0.670554,0.781178
FG,0.643683,0.617011,0.650419,0.542621,0.658926,0.766507
MPG,0.630499,0.608759,0.620092,0.570127,0.63655,0.760116
FGA,0.623244,0.597939,0.651549,0.513856,0.628688,0.776713


The top 5 most correlated stats with salary for all positions are PPG, total PTS, FG, MPG, and FGA.

### Outlier Calculation:

Performing an outlier calculation to identify and list any possible outliers.

In [41]:
#OUTLIER
outlier_calc = df['salary'].mean() + 3 * df['salary'].std()
outliers = df.query('salary > @outlier_calc')

display(outliers[['salary']].sort_values(by='salary', ascending=False))

Unnamed: 0_level_0,salary
Player,Unnamed: 1_level_1
Stephen Curry,48070014
Russell Westbrook,47559433
John Wall,47345760
LeBron James,44474988
Kevin Durant,44119845
Bradley Beal,43279250
Giannis Antetokounmpo,42492492
Paul George,42492492
Kawhi Leonard,42492492
Damian Lillard,42492492


It appears that there are 10 salary outliers, all earning just above 42 million.

### Summary:

In order to better analyze our data, we created some basic features for our dataset and added them into the data.  From there we ran a correlation test to see which features best associate with salary. We found that Points Per Game (PPG), total Points (PTS), Field Goals Made (FG), Minutes Per Game (MPG), and Free Throws Attempted (FTA) were the 5 most influencial features on salary across all positions. Each position however had a different ranked order and strength for these features, although all were still moderate to strongly correlated.  Finally we conducted an outlier calculation and found that there are 10 outliers in the salary column, all of these players making over 42 million this season.