# Cleaning NBA Player Data

The goal of this Notebook is to clean up the NBA player data that I previously extracted from the Basketball Reference website (www.basketball-reference.com).

In [1]:
# Import the necessary libraries for cleaning the NBA player data
import sys
import pandas as pd
import numpy as np

In [2]:
# load the file from my Google Drive
players_df = pd.read_csv('/content/drive/MyDrive/NBA_players_data.csv')

## Initial Investigation and Cleaning

Let's print the dataset to get an idea of what we're going to have to do to it.

In [3]:
players_df

Unnamed: 0.1,Unnamed: 0,Player,From,To,Pos,Ht,Wt,Birth Date,Colleges,G,PTS,TRB,AST,FG%,FG3%,FT%,eFG%,PER,WS
0,0,Alaa Abdelnaby,1991,1995,F-C,6-10,240.0,"June 24, 1968",Duke,256,5.7,3.3,0.3,50.2,0.0,70.1,50.2,13.0,4.8
1,1,Zaid Abdul-Aziz,1969,1978,C-F,6-9,235.0,"April 7, 1946",Iowa State,505,9.0,8.0,1.2,42.8,72.8,15.1,15.1,17.5,
2,2,Kareem Abdul-Jabbar*,1970,1989,C,7-2,225.0,"April 16, 1947",UCLA,1560,24.6,11.2,3.6,55.9,5.6,72.1,55.9,24.6,273.4
3,3,Mahmoud Abdul-Rauf,1991,2001,G,6-1,162.0,"March 9, 1969",LSU,586,14.6,1.9,3.5,44.2,35.4,90.5,47.2,15.4,25.2
4,4,Tariq Abdul-Wahad,1998,2003,F,6-6,223.0,"November 3, 1974","Michigan, San Jose State",236,7.8,3.3,1.1,41.7,23.7,70.3,42.2,11.4,3.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5018,5018,Ante Žižić,2018,2020,F-C,6-10,266.0,"January 4, 1997",,113,6.0,3.9,0.6,58.1,-,71.1,58.1,17.4,3.5
5019,5019,Jim Zoet,1983,1983,C,7-1,240.0,"December 20, 1953",Kent State University,7,0.3,1.1,0.1,20.0,-,-,20.0,-0.8,-0.1
5020,5020,Bill Zopf,1971,1971,G,6-1,170.0,"June 7, 1948",Duquesne,53,2.2,0.9,1.4,36.3,55.6,9.6,9.6,-0.1,
5021,5021,Ivica Zubac,2017,2022,C,7-0,240.0,"March 18, 1997",,360,8.3,6.5,1.1,59.7,10.0,75.4,59.7,19.2,26.1


As we can see from the print out of the dataset, there seem to be some missing values in a couple of the columns, so we must address them.

First, there are 2 columns that we can get rid of, as they should have no impact on whether a player will enter the Hall of Fame. The columns to get rid of are the 'Unnamed: 0' and 'Birth Date' columns. The 'Unnamed: 0' column is the index from the csv file that was imported into the DataFrame. We can remove it as the DataFrame provides its own index values. The 'Birth Date' column provides important biographical information, but will not provide any meaningful insight into whether a player will become a Hall of Famer or not. 

In [4]:
players_df.drop(['Unnamed: 0', 'Birth Date'], axis=1, inplace=True)
players_df.replace('-', np.nan, inplace=True)

After removing those 2 columns, we can take a look at how many missing values there are for each column remaining.

In [5]:
missing_data = players_df.isnull()
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")

Player
False    5023
Name: Player, dtype: int64

From
False    5023
Name: From, dtype: int64

To
False    5023
Name: To, dtype: int64

Pos
False    5023
Name: Pos, dtype: int64

Ht
False    5023
Name: Ht, dtype: int64

Wt
False    5018
True        5
Name: Wt, dtype: int64

Colleges
False    4669
True      354
Name: Colleges, dtype: int64

G
False    5023
Name: G, dtype: int64

PTS
False    5023
Name: PTS, dtype: int64

TRB
False    4731
True      292
Name: TRB, dtype: int64

AST
False    5023
Name: AST, dtype: int64

FG%
False    4989
True       34
Name: FG%, dtype: int64

FG3%
False    4463
True      560
Name: FG3%, dtype: int64

FT%
False    4482
True      541
Name: FT%, dtype: int64

eFG%
False    4650
True      373
Name: eFG%, dtype: int64

PER
False    5019
True        4
Name: PER, dtype: int64

WS
False    3895
True     1128
Name: WS, dtype: int64



The result shows us that 9 out of the 17 columns, more than half, have missing values. 

## Addition of the 3-Point Line (FG3% and eFG%)

The modern NBA 3-pt line wasn't added until the 1979-1980 season. Therefore, any player who finished their career before that season won't have a value for either FG3% or eFG% (eFG% is based off of FG3%). These players pages on BasketballReference.com won't have columns for these statistics either, so we will have to adjust values in the DataFrame to account for this.

After close examination of the data, if a player's career ended before the addition of the 3-point line, some of their statistics were added to the incorrect columns. For these players, the FT% was added to the FG3% column, the PER was added to both the FT% and eFG% columns, and the WS was in the PER column. Needless to say, we have to correct this and move the correct data to the correct columns, while making sure the FG3% and eFG% columns have a missing value. 

In [6]:
for i in range(0,len(players_df)):
  if players_df['To'].iloc[i] < 1980:
    players_df['WS'].iloc[i] = players_df['PER'].iloc[i] # the 'WS' column value is actually the value in the 'PER' column
    players_df['PER'].iloc[i] = players_df['eFG%'].iloc[i] # the 'PER' column value is actually the value in the 'eFG%' column
    players_df['eFG%'].iloc[i] = np.nan # 'eFG%' doesn't have a value, so we assign it 'NaN'
    players_df['FT%'].iloc[i] = players_df['FG3%'].iloc[i] # the 'FT%' column value is actually the value in the 'FG3%' column
    players_df['FG3%'].iloc[i] = np.nan # 'FG3%' doesn't have a value, so we assign it 'NaN'

players_df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


Unnamed: 0,Player,From,To,Pos,Ht,Wt,Colleges,G,PTS,TRB,AST,FG%,FG3%,FT%,eFG%,PER,WS
0,Alaa Abdelnaby,1991,1995,F-C,6-10,240.0,Duke,256,5.7,3.3,0.3,50.2,0.0,70.1,50.2,13.0,4.8
1,Zaid Abdul-Aziz,1969,1978,C-F,6-9,235.0,Iowa State,505,9.0,8.0,1.2,42.8,,72.8,,15.1,17.5
2,Kareem Abdul-Jabbar*,1970,1989,C,7-2,225.0,UCLA,1560,24.6,11.2,3.6,55.9,5.6,72.1,55.9,24.6,273.4
3,Mahmoud Abdul-Rauf,1991,2001,G,6-1,162.0,LSU,586,14.6,1.9,3.5,44.2,35.4,90.5,47.2,15.4,25.2
4,Tariq Abdul-Wahad,1998,2003,F,6-6,223.0,"Michigan, San Jose State",236,7.8,3.3,1.1,41.7,23.7,70.3,42.2,11.4,3.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5018,Ante Žižić,2018,2020,F-C,6-10,266.0,,113,6.0,3.9,0.6,58.1,,71.1,58.1,17.4,3.5
5019,Jim Zoet,1983,1983,C,7-1,240.0,Kent State University,7,0.3,1.1,0.1,20.0,,,20.0,-0.8,-0.1
5020,Bill Zopf,1971,1971,G,6-1,170.0,Duquesne,53,2.2,0.9,1.4,36.3,,55.6,,9.6,-0.1
5021,Ivica Zubac,2017,2022,C,7-0,240.0,,360,8.3,6.5,1.1,59.7,10.0,75.4,59.7,19.2,26.1


As we can see now, all the right values are in the right place and there are missing values for FG3% and eFG% if the player finished their career before the addition of the 3-point line.

## Changing Data Types

Let's take a look at the data type for each column, to make sure they are what we want them to be.

In [7]:
players_df.dtypes

Player       object
From          int64
To            int64
Pos          object
Ht           object
Wt          float64
Colleges     object
G             int64
PTS         float64
TRB          object
AST         float64
FG%          object
FG3%         object
FT%          object
eFG%         object
PER          object
WS           object
dtype: object

There are a few columns that are type 'object' that we need to be type 'float'. So, we will change them to be the correct data type.

In [8]:
players_df[['TRB', 'FG%', 'FG3%', 'FT%', 'eFG%', 'PER', 'WS']] = players_df[['TRB', 'FG%', 'FG3%', 'FT%', 'eFG%', 'PER', 'WS']].astype('float')

## Combining Positions that are the Same

Now let's see the different positions we have in the dataset.

In [9]:
players_df['Pos'].value_counts()

G      1795
F      1441
C       532
F-C     413
G-F     397
C-F     226
F-G     219
Name: Pos, dtype: int64

There are a couple values that are the same position but they have values flip-flopped. Specifically, 'F-C' and 'C-F' are the same position, and 'F-G' and 'G-F' are also the same position. So, we will condense these 2 instances where there are 2 values for the same position down into 1 single position. 

In [None]:
for i in range(0,len(players_df)):
  if players_df['Pos'].iloc[i] == 'C-F': # combine 'C-F' and 'F-C' to be one position
    players_df['Pos'].iloc[i] = 'F-C'
  elif players_df['Pos'].iloc[i] == 'F-G': # combine 'G-F' and 'F-G' to be one position
    players_df['Pos'].iloc[i] = 'G-F'
  else:
    players_df['Pos'].iloc[i] = players_df['Pos'].iloc[i]

In [11]:
players_df['Pos'].value_counts()

G      1795
F      1441
F-C     639
G-F     616
C       532
Name: Pos, dtype: int64

Now, we see that there are 5 distinct position types. Creating these specific position values is important for the next step of the data cleaning, filling in the remaining missing values, as the position a player plays has a big impact on different statistics. 

## Filling in Missing Values

There are still a couple of columns that have missing data. Getting rid of all of these rows would remove a large portion of the dataset, which we don't want. We also don't want to fill each missing value with the mean of the column it is in, as basketball statistics depend heavily on the position of the player. Therefore, we will find the average for each position in the columns that are missing data, and then fill in the value for the position of that player. In other words, if a player plays the 'G' position and is missing a value for 'FG3%', we will fill that missing value with the mean of 'FG3%' for the position 'G'.

In [12]:
tfg_avg = players_df.groupby('Pos', as_index=False)['FG3%'].mean() # calculate the mean of 'FG3%' for each position
eff_avg = players_df.groupby('Pos', as_index=False)['eFG%'].mean() # calculate the mean of 'eFG%' for each position
fg_avg = players_df.groupby('Pos', as_index=False)['FG%'].mean() # calculate the mean of 'FG%' for each position
ft_avg = players_df.groupby('Pos', as_index=False)['FT%'].mean() # calculate the mean of 'FT%' for each position
wt_avg = players_df.groupby('Pos', as_index=False)['Wt'].mean() # calculate the mean of 'Wt' for each position
per_avg = players_df.groupby('Pos', as_index=False)['PER'].mean() # calculate the mean of 'PER' for each position
reb_avg = players_df.groupby('Pos', as_index=False)['TRB'].mean() # calculate the mean of 'TRB' for each position
ws_avg = players_df.groupby('Pos', as_index=False)['WS'].mean() # calculate the mean of 'WS' for each position
print(tfg_avg)
print('--------------')
print(eff_avg)
print('--------------')
print(fg_avg)
print('--------------')
print(ft_avg)
print('--------------')
print(wt_avg)
print('--------------')
print(per_avg)
print('--------------')
print(reb_avg)
print('--------------')
print(ws_avg)

   Pos       FG3%
0    C  15.028571
1    F  24.173389
2  F-C  17.183380
3    G  28.504624
4  G-F  29.496703
--------------
   Pos       eFG%
0    C  46.158718
1    F  45.946045
2  F-C  49.024051
3    G  44.581890
4  G-F  47.673797
--------------
   Pos        FG%
0    C  43.746958
1    F  40.973739
2  F-C  44.524765
3    G  38.627762
4  G-F  40.867427
--------------
   Pos        FT%
0    C  59.596250
1    F  63.561556
2  F-C  64.563462
3    G  69.628921
4  G-F  69.048344
--------------
   Pos          Wt
0    C  245.073585
1    F  218.525694
2  F-C  226.541471
3    G  188.160625
4  G-F  200.689935
--------------
   Pos        PER
0    C  14.402590
1    F  14.218135
2  F-C  16.597104
3    G  13.285765
4  G-F  15.294041
--------------
   Pos       TRB
0    C  4.081102
1    F  3.047080
2  F-C  5.489580
3    G  1.812368
4  G-F  3.133333
--------------
   Pos         WS
0    C  14.987970
1    F   9.986875
2  F-C  23.393427
3    G  11.127703
4  G-F  20.464610


As we can see from the outputs, the 5 different positions have widely different means for most of the statistics. For example, the mean 'FT%' for 'G' is 69.6%, but the mean for 'C' is 59.6%, 10% less than that for 'G'. 

Now, that we have the means for each value in 'Pos', we will loop through our dataset and replace missing values with the player's corresponding position's mean.

In [None]:
# Loop for 'FG3%'
for i in range(0,len(players_df)):
  if np.isnan(players_df['FG3%'].iloc[i]) :
    val = tfg_avg[tfg_avg['Pos'] == players_df['Pos'].iloc[i]]['FG3%'] # find the players position and the mean value for that position
    players_df['FG3%'].iloc[i] = val.round(decimals=1)
  else:
    continue
# Loop for 'eFG%'
for i in range(0,len(players_df)):
  if np.isnan(players_df['eFG%'].iloc[i]):
    val = eff_avg[eff_avg['Pos'] == players_df['Pos'].iloc[i]]['eFG%'] # find the players position and the mean value for that position
    players_df['eFG%'].iloc[i] = val.round(decimals=1)
  else:
    continue
# Loop for 'FG%'
for i in range(0,len(players_df)):
  if np.isnan(players_df['FG%'].iloc[i]):
    val = fg_avg[fg_avg['Pos'] == players_df['Pos'].iloc[i]]['FG%'] # find the players position and the mean value for that position
    players_df['FG%'].iloc[i] = val.round(decimals=1)
  else:
    continue
# Loop for 'FT%'
for i in range(0,len(players_df)):
  if np.isnan(players_df['FT%'].iloc[i]):
    val = ft_avg[ft_avg['Pos'] == players_df['Pos'].iloc[i]]['FT%'] # find the players position and the mean value for that position
    players_df['FT%'].iloc[i] = val.round(decimals=1)
  else:
    continue
# Loop for 'Wt'
for i in range(0,len(players_df)):
  if np.isnan(players_df['Wt'].iloc[i]):
    val = wt_avg[wt_avg['Pos'] == players_df['Pos'].iloc[i]]['Wt'] # find the players position and the mean value for that position
    players_df['Wt'].iloc[i] = val.round(decimals=1)
  else:
    continue
# Loop for 'PER'
for i in range(0,len(players_df)):
  if np.isnan(players_df['PER'].iloc[i]):
    val = per_avg[per_avg['Pos'] == players_df['Pos'].iloc[i]]['PER'] # find the players position and the mean value for that position
    players_df['PER'].iloc[i] = val.round(decimals=1)
  else:
    continue
# Loop for 'TRB'
for i in range(0,len(players_df)):
  if np.isnan(players_df['TRB'].iloc[i]):
    val = reb_avg[reb_avg['Pos'] == players_df['Pos'].iloc[i]]['TRB'] # find the players position and the mean value for that position
    players_df['TRB'].iloc[i] = val.round(decimals=1)
  else:
    continue
# Loop for 'WS'
for i in range(0,len(players_df)):
  if np.isnan(players_df['WS'].iloc[i]):
    val = ws_avg[ws_avg['Pos'] == players_df['Pos'].iloc[i]]['WS'] # find the players position and the mean value for that position
    players_df['WS'].iloc[i] = val.round(decimals=1)
  else:
    continue

The last column with missing values is 'Colleges'. This column missing values mean that the player either went to the NBA straight from High School or they came from Overseas and played in a league outside of the US. So we will replace missing values in this column with 'High School/Overseas'.

In [14]:
players_df['Colleges'].replace(np.nan, 'High School/Overseas', inplace=True)

Now let's look at the updated dataset.

In [15]:
players_df

Unnamed: 0,Player,From,To,Pos,Ht,Wt,Colleges,G,PTS,TRB,AST,FG%,FG3%,FT%,eFG%,PER,WS
0,Alaa Abdelnaby,1991,1995,F-C,6-10,240.0,Duke,256,5.7,3.3,0.3,50.2,0.0,70.1,50.2,13.0,4.8
1,Zaid Abdul-Aziz,1969,1978,F-C,6-9,235.0,Iowa State,505,9.0,8.0,1.2,42.8,17.2,72.8,49.0,15.1,17.5
2,Kareem Abdul-Jabbar*,1970,1989,C,7-2,225.0,UCLA,1560,24.6,11.2,3.6,55.9,5.6,72.1,55.9,24.6,273.4
3,Mahmoud Abdul-Rauf,1991,2001,G,6-1,162.0,LSU,586,14.6,1.9,3.5,44.2,35.4,90.5,47.2,15.4,25.2
4,Tariq Abdul-Wahad,1998,2003,F,6-6,223.0,"Michigan, San Jose State",236,7.8,3.3,1.1,41.7,23.7,70.3,42.2,11.4,3.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5018,Ante Žižić,2018,2020,F-C,6-10,266.0,High School/Overseas,113,6.0,3.9,0.6,58.1,17.2,71.1,58.1,17.4,3.5
5019,Jim Zoet,1983,1983,C,7-1,240.0,Kent State University,7,0.3,1.1,0.1,20.0,15.0,59.6,20.0,-0.8,-0.1
5020,Bill Zopf,1971,1971,G,6-1,170.0,Duquesne,53,2.2,0.9,1.4,36.3,28.5,55.6,44.6,9.6,-0.1
5021,Ivica Zubac,2017,2022,C,7-0,240.0,High School/Overseas,360,8.3,6.5,1.1,59.7,10.0,75.4,59.7,19.2,26.1


## Final Changes

We want to know how many years each player was in the NBA for, as this is more valuable than knowing the values for 'From' and 'To'.

In [16]:
# Add a new column for the number of years each player was in the NBA
players_df['Years'] = players_df['To'] - players_df['From'] + 1
players_df.head()

Unnamed: 0,Player,From,To,Pos,Ht,Wt,Colleges,G,PTS,TRB,AST,FG%,FG3%,FT%,eFG%,PER,WS,Years
0,Alaa Abdelnaby,1991,1995,F-C,6-10,240.0,Duke,256,5.7,3.3,0.3,50.2,0.0,70.1,50.2,13.0,4.8,5
1,Zaid Abdul-Aziz,1969,1978,F-C,6-9,235.0,Iowa State,505,9.0,8.0,1.2,42.8,17.2,72.8,49.0,15.1,17.5,10
2,Kareem Abdul-Jabbar*,1970,1989,C,7-2,225.0,UCLA,1560,24.6,11.2,3.6,55.9,5.6,72.1,55.9,24.6,273.4,20
3,Mahmoud Abdul-Rauf,1991,2001,G,6-1,162.0,LSU,586,14.6,1.9,3.5,44.2,35.4,90.5,47.2,15.4,25.2,11
4,Tariq Abdul-Wahad,1998,2003,F,6-6,223.0,"Michigan, San Jose State",236,7.8,3.3,1.1,41.7,23.7,70.3,42.2,11.4,3.5,6


It is worth noting that some players took years off in between when they started and finished their career, so the value for years may not be entirely accurate for every single player.

Now that we've cleaned the data and updated any missing values, let's look at some statistics for the dataset.

In [17]:
players_df.describe()

Unnamed: 0,From,To,Wt,G,PTS,TRB,AST,FG%,FG3%,FT%,eFG%,PER,WS,Years
count,5023.0,5023.0,5023.0,5023.0,5023.0,5023.0,5023.0,5023.0,5023.0,5023.0,5023.0,5023.0,5023.0,5023.0
mean,1988.400956,1992.587896,209.318754,270.059327,6.411328,3.035537,1.416564,40.867768,24.518336,66.110472,46.083556,14.33914,13.914692,5.18694
std,22.471835,23.180842,26.015055,310.280656,4.76454,2.244231,1.358773,10.720956,12.31834,19.150188,8.805673,10.918856,25.682231,4.509547
min,1947.0,1947.0,114.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-48.6,-52.7,1.0
25%,1971.0,1974.5,190.0,32.0,2.9,1.5,0.5,36.7,17.2,60.7,44.6,9.7,0.1,1.0
50%,1990.0,1996.0,210.0,128.0,5.2,2.5,1.0,42.3,26.8,69.9,46.2,12.6,3.0,3.0
75%,2008.0,2014.0,225.0,447.0,8.7,4.0,1.9,46.4,31.55,77.7,49.2,15.3,15.7,8.0
max,2022.0,2022.0,360.0,1611.0,30.1,22.9,11.2,100.0,100.0,100.0,150.0,100.0,273.4,23.0


From these statistics we can see some interesting things. The average NBA player averages 6.4 points, 3 rebounds, and 1.4 assists per game, and also plays in an average of 270 games over 5.2 years in  the NBA. 

## Splitting Current and Former Players

The last thing we want to do is split the dataset into 2 different datasets, one for current NBA players and one for former NBA players. The dataset of former players will be used to train and test a model to predict whether a player will be a Hall of Famer or not. Once the model is trained, we will use the current players dataset to make predictions.

In [18]:
current_df = players_df[players_df['To'] == max(players_df['To'])] # current players
former_df = players_df[players_df['To'] != max(players_df['To'])] # former players

We then add a column to the former players dataset, 'HOF', which tells us if each inidividual player is in the Hall of Fame or not. If a player has an asterisk (*) next to their name, they are in the Hall of Fame. In the new 'HOF' column, a value of 0 denotes that they are not in the Hall of Fame, and a value of 1 denotes that they are in the Hall of Fame. 

In [None]:
former_df['HOF'] = 0
for i in range(0,len(former_df)):
  if former_df['Player'].str.contains('\*').iloc[i] == True:
    former_df['HOF'].iloc[i] = 1
  else:
    former_df['HOF'].iloc[i] = 0

In [20]:
former_df['HOF'].value_counts()

0    4254
1     164
Name: HOF, dtype: int64

We can see that, of the former NBA players, only 164 out of 4,418 are in the Hall of Fame, about 3.7%. However, in order to be considered for the NBA Hall of Fame, a player must be retired for more than 4 years, so there are arguably a hand full of former players that retired recently and will soon become Hall of Famers. Anyways, since there are 605 current players, we should expect about 22 or 23 of them to become Hall of Famers based on the former players.

The final thing to do is to reset the index for the current and former player datasets.

In [21]:
current_df.reset_index(drop=True)

Unnamed: 0,Player,From,To,Pos,Ht,Wt,Colleges,G,PTS,TRB,AST,FG%,FG3%,FT%,eFG%,PER,WS,Years
0,Precious Achiuwa,2021,2022,F,6-8,225.0,Memphis,134,7.2,5.1,0.8,46.8,35.7,55.6,50.2,13.1,3.8,2
1,Steven Adams,2014,2022,C,6-11,265.0,Pitt,664,9.3,8.0,1.5,58.7,7.1,54.7,58.7,17.0,56.0,9
2,Bam Adebayo,2018,2022,F-C,6-9,255.0,Kentucky,343,13.5,8.3,3.5,55.8,14.0,74.1,55.9,20.0,35.4,5
3,Santi Aldama,2022,2022,F,6-11,224.0,Loyola (MD),32,4.1,2.7,0.7,40.2,12.5,62.5,42.4,10.2,0.3,1
4,LaMarcus Aldridge,2007,2022,F-C,6-11,250.0,Texas,1076,19.1,8.1,1.9,49.3,32.0,81.3,49.9,20.7,115.7,16
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
600,Thaddeus Young,2008,2022,F,6-8,235.0,Georgia Tech,1085,12.8,5.8,1.8,50.2,33.2,66.3,52.3,16.4,68.9,15
601,Trae Young,2019,2022,G,6-1,164.0,Oklahoma,280,25.3,3.9,9.1,44.0,35.5,87.3,51.0,22.3,26.3,4
602,Omer Yurtseven,2022,2022,C,7-0,264.0,"NC State, Georgetown",56,5.3,5.3,0.9,52.6,9.1,62.3,52.8,17.4,2.1,1
603,Cody Zeller,2014,2022,F-C,6-11,240.0,Indiana,494,8.5,6.0,1.4,52.0,22.1,73.1,52.5,16.2,32.1,9


In [22]:
former_df.reset_index(drop=True)

Unnamed: 0,Player,From,To,Pos,Ht,Wt,Colleges,G,PTS,TRB,AST,FG%,FG3%,FT%,eFG%,PER,WS,Years,HOF
0,Alaa Abdelnaby,1991,1995,F-C,6-10,240.0,Duke,256,5.7,3.3,0.3,50.2,0.0,70.1,50.2,13.0,4.8,5,0
1,Zaid Abdul-Aziz,1969,1978,F-C,6-9,235.0,Iowa State,505,9.0,8.0,1.2,42.8,17.2,72.8,49.0,15.1,17.5,10,0
2,Kareem Abdul-Jabbar*,1970,1989,C,7-2,225.0,UCLA,1560,24.6,11.2,3.6,55.9,5.6,72.1,55.9,24.6,273.4,20,1
3,Mahmoud Abdul-Rauf,1991,2001,G,6-1,162.0,LSU,586,14.6,1.9,3.5,44.2,35.4,90.5,47.2,15.4,25.2,11,0
4,Tariq Abdul-Wahad,1998,2003,F,6-6,223.0,"Michigan, San Jose State",236,7.8,3.3,1.1,41.7,23.7,70.3,42.2,11.4,3.5,6,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4413,Paul Zipser,2017,2018,G-F,6-8,215.0,High School/Overseas,98,4.7,2.6,0.8,37.1,33.5,76.9,44.8,6.1,0.0,2,0
4414,Ante Žižić,2018,2020,F-C,6-10,266.0,High School/Overseas,113,6.0,3.9,0.6,58.1,17.2,71.1,58.1,17.4,3.5,3,0
4415,Jim Zoet,1983,1983,C,7-1,240.0,Kent State University,7,0.3,1.1,0.1,20.0,15.0,59.6,20.0,-0.8,-0.1,1,0
4416,Bill Zopf,1971,1971,G,6-1,170.0,Duquesne,53,2.2,0.9,1.4,36.3,28.5,55.6,44.6,9.6,-0.1,1,0


Finally, let's save all 3 of the datasets to csv files to use for data exploration and analysis.

In [23]:
players_df.to_csv('/content/drive/MyDrive/NBA_players_clean.csv')
former_df.to_csv('/content/drive/MyDrive/former_players.csv')
current_df.to_csv('/content/drive/MyDrive/current_players.csv')