# Preprocessing the NBA dataset

## Import the data

In [1]:
import pandas as pd
import matplotlib as plt

In [2]:
X = pd.read_csv("Seasons_Stats.csv", index_col = 'Unnamed: 0')
y = pd.read_csv("datasets_17435_36417_NBA_season1718_salary.csv",index_col = 'Unnamed: 0')

In [3]:
X.head()

Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,TS%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,1950.0,Curly Armstrong,G-F,31.0,FTW,63.0,,,,0.368,...,0.705,,,,176.0,,,,217.0,458.0
1,1950.0,Cliff Barker,SG,29.0,INO,49.0,,,,0.435,...,0.708,,,,109.0,,,,99.0,279.0
2,1950.0,Leo Barnhorst,SF,25.0,CHS,67.0,,,,0.394,...,0.698,,,,140.0,,,,192.0,438.0
3,1950.0,Ed Bartels,F,24.0,TOT,15.0,,,,0.312,...,0.559,,,,20.0,,,,29.0,63.0
4,1950.0,Ed Bartels,F,24.0,DNN,13.0,,,,0.308,...,0.548,,,,20.0,,,,27.0,59.0


## Editing the dataframe

We start by editing some of the stat columns. To be exact, columns like Points and Assists are the total values over the whole season. What we prefer for our purposes is the points etc PER GAME. So we select the categories we want and divide them by the total amount of games!

In [4]:
col_G = ['FG','MP', 'FGA','3P', '3PA','2P', '2PA','FT', 'FTA','ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK','TOV', 'PF', 'PTS']

In [5]:
X[col_G] =X[col_G].div(X.G, axis = 0) 

In [6]:
X.head()

Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,TS%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,1950.0,Curly Armstrong,G-F,31.0,FTW,63.0,,,,0.368,...,0.705,,,,2.793651,,,,3.444444,7.269841
1,1950.0,Cliff Barker,SG,29.0,INO,49.0,,,,0.435,...,0.708,,,,2.22449,,,,2.020408,5.693878
2,1950.0,Leo Barnhorst,SF,25.0,CHS,67.0,,,,0.394,...,0.698,,,,2.089552,,,,2.865672,6.537313
3,1950.0,Ed Bartels,F,24.0,TOT,15.0,,,,0.312,...,0.559,,,,1.333333,,,,1.933333,4.2
4,1950.0,Ed Bartels,F,24.0,DNN,13.0,,,,0.308,...,0.548,,,,1.538462,,,,2.076923,4.538462


Select just the 2016-2017 season:

In [7]:
X1 =X.loc[X['Year'] == 2017]

We need to approach the issue of in season player transfers. There is a fair amount of players who played for multiple teams and thus appear multiple times. Luckily the dataset includes a team choice called 'Total' which adds up all the contributions from different teams. So we keep only that category

In [8]:
list_drop =X1.loc[X1['Tm'] == 'TOT', 'Player'].tolist()

In [9]:
index_drop = X1[X1['Player'].isin(list_drop)].loc[X1['Tm']!='TOT'].index.tolist()

In [10]:
X2 = X1.drop(index_drop)

## Export our file

we now combine the salary and stat dataframes into one and then save it to a csv file

In [13]:
y1 = y.drop(['Tm'], axis =1 ).groupby('Player', as_index = False).sum()


In [14]:
Com_data = pd.merge(X2, y1, on='Player')

In [150]:
Com_data.to_csv('Complete1617.csv')