# NBA Data Analysis

The data used in this notebook was downloaded from [Kaggle](https://www.kaggle.com/drgilermo/nba-players-stats#Seasons_Stats.csv).  The original source of the data is [Basketball-reference](http://www.basketball-reference.com/).


## General Intro EDA

In [1]:
%load_ext nb_black

<IPython.core.display.Javascript object>

In [2]:
import pandas as pd
import numpy as np

from scipy import stats
from sklearn.preprocessing import StandardScaler, MinMaxScaler

import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

data_url = "https://docs.google.com/spreadsheets/d/1m0jaYL1KGjxW1cKJUQxVTcPOnm7v7NZEBKRZADCmc68/export?format=csv"
nba = pd.read_csv(data_url)
nba.head()

Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,TS%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,1950.0,Curly Armstrong,G-F,31.0,FTW,63.0,,,,0.368,...,0.705,,,,176.0,,,,217.0,458.0
1,1950.0,Cliff Barker,SG,29.0,INO,49.0,,,,0.435,...,0.708,,,,109.0,,,,99.0,279.0
2,1950.0,Leo Barnhorst,SF,25.0,CHS,67.0,,,,0.394,...,0.698,,,,140.0,,,,192.0,438.0
3,1950.0,Ed Bartels,F,24.0,TOT,15.0,,,,0.312,...,0.559,,,,20.0,,,,29.0,63.0
4,1950.0,Ed Bartels,F,24.0,DNN,13.0,,,,0.308,...,0.548,,,,20.0,,,,27.0,59.0


<IPython.core.display.Javascript object>

In [3]:
nba = nba.drop(columns=["blanl", "blank2"])
nba = nba.dropna(subset=["Year", "Age", "Pos", "Tm"])

<IPython.core.display.Javascript object>

In [4]:
crucial_cols = ["Year", "Player", "Pos", "Tm"]

rows_og, cols_og = nba.shape
rows, cols = nba.dropna(subset=crucial_cols).shape

rows_og - rows


0

<IPython.core.display.Javascript object>

Preprocessing from before:

## Feature Engineering

We have a lot of useful data here, but most predictive models that we'll be looking at only like numeric data.  To still use our information we have to do a little bit of reformatting.

### One Hot Encoding / Dummy Encoding

For example, for team, we might "one-hot encode" (aka create dummy variables).  This is a way of creating a series of variables indicating True/False.

Create a dataframe that is a subset of the `nba` dataframe.  Only include in this subset:

* Columns: `PTS`, `Player`, & `Tm`
* Rows: a random selection of 15 rows (use 42 as the `random_state`)

In [16]:
# subset columns
nba_sub = nba[["Year", "PTS", "Age", "Tm"]]

# subset rows
nba_sub = nba_sub.sample(n=15, random_state=42)


nba_sub

Unnamed: 0,Year,PTS,Age,Tm
2926,1970.0,660.0,24.0,ATL
6455,1982.0,2.0,23.0,POR
12753,1996.0,1384.0,21.0,PHI
4613,1976.0,1465.0,27.0,HOU
20691,2011.0,150.0,24.0,BOS
3063,1970.0,1181.0,25.0,SFW
20271,2010.0,667.0,32.0,DEN
13367,1997.0,938.0,22.0,POR
10862,1992.0,1195.0,27.0,SAC
13363,1997.0,1435.0,20.0,BOS


<IPython.core.display.Javascript object>

Use `pd.get_dummies()` on the subset.

* What happened?
* What might we change about this and why?
* What does the `drop_first` argument of `pd.get_dummies()` do and why?

In [6]:
pd.get_dummies(nba_sub, columns=["Tm"], drop_first=True)

Unnamed: 0,Year,PTS,Player,Tm_BOS,Tm_DEN,Tm_HOU,Tm_MLH,Tm_PHI,Tm_POR,Tm_SAC,Tm_SEA,Tm_SFW,Tm_TOT
2926,1970.0,660.0,Gary Gregor,0,0,0,0,0,0,0,0,0,0
6455,1982.0,2.0,Carl Bailey,0,0,0,0,0,1,0,0,0,0
12753,1996.0,1384.0,Jerry Stackhouse,0,0,0,0,1,0,0,0,0,0
4613,1976.0,1465.0,Rudy Tomjanovich,0,0,1,0,0,0,0,0,0,0
20691,2011.0,150.0,Semih Erden,1,0,0,0,0,0,0,0,0,0
3063,1970.0,1181.0,Ron Williams,0,0,0,0,0,0,0,0,1,0
20271,2010.0,667.0,Kenyon Martin,0,1,0,0,0,0,0,0,0,0
13367,1997.0,938.0,Rasheed Wallace,0,0,0,0,0,1,0,0,0,0
10862,1992.0,1195.0,Wayman Tisdale,0,0,0,0,0,0,1,0,0,0
13363,1997.0,1435.0,Antoine Walker,1,0,0,0,0,0,0,0,0,0


<IPython.core.display.Javascript object>

There are some issues that come up with using `pd.get_dummies` in a machine learning workflow.  For today, we'll stick with it due to its ease of use compared to more powerful options.

Using [`sklearn.preprocessing.OneHotEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) overcomes the issues that `pd.get_dummies` can run into, but it has a little more complex usage.

### Binary Encoding

Create a binary column named `is_old` that shows whether or not the `Year` variable is before 1980.

In [7]:
nba_sub["is_old"] = nba_sub["Year"] > 1980
nba_sub["is_old"] = nba_sub["is_old"]

<IPython.core.display.Javascript object>

Create a binary column named `is_california` that shows whether or not a team is located in california.

In [8]:
ca_teams = ["LAL", "LAC", "GSW", "SAC"]
nba["in_cali"] = nba["Tm"].isin(ca_teams)
nba["in_cali"]

0        False
1        False
2        False
3        False
4        False
         ...  
24686    False
24687    False
24688    False
24689    False
24690     True
Name: in_cali, Length: 24616, dtype: bool

<IPython.core.display.Javascript object>

### Ordinal Encoding

Let's make up some data to be ordinal encoded.

* Using the `grades` list create a sample of 20 random letters
* Create a 1 column DataFrame from this sample

In [9]:
np.random.seed(42)

grades = ["A", "B", "C", "D", "F"]
rand_grades = np.random.choice(grades, size=20)

grade_df = pd.DataFrame({"grade": rand_grades})
grade_df.head()

Unnamed: 0,grade
0,D
1,F
2,C
3,F
4,F


<IPython.core.display.Javascript object>

Create a variable that is an ordinal encoding of grade.  Have `A` be 1 and `F` be 5.

In [10]:
grade_map = {"A": 1, "B": 2, "C": 3, "D": 4, "F": 5}
grade_df["grade_num"] = grade_df["grade"].replace(grade_map)
grade_df.head()

Unnamed: 0,grade,grade_num
0,D,4
1,F,5
2,C,3
3,F,5
4,F,5


<IPython.core.display.Javascript object>

### Scaling

Some methods we'll see are sensitive to our variables being on different scales.  For example, if you have variables for a person's height and their annual income, the height feature will have a much much smaller value than the income feature.  In some methods, this will lead to the income variable being a louder signal than the height variable.  Larger magnitude variables can end up drowning out smaller magnitude ones, and this can be an issue if we think height will be an important predictor.

To address this issue, we can scale the variables to have equal footing.  This won't change the shape of their distribution.  Not changing shape means that the patterns within and between the variable aren't lost by scaling, the patterns are preserved, the values have just been standardized.

* Create a subset of the nba dataset that has the columns `PTS` and `Age`.
* Drop all NAs
* Use the pandas boxplot method on this resulting data.
* Plot these variables on a scatter plot.

We're going to split into groups to evaluate 2 different scalers.  The below code will decide the groups.

In [11]:
# fmt: off
data_scientists = ['Anthony', 'Dillan', 'Gaukhar', 'Harinder', 'James',
                   'Josh', 'Leon', 'Mason', 'Rachel', 'Steve']
# fmt: on

# Randomize order
np.random.shuffle(data_scientists)

n = len(data_scientists) // 2
print(f"Use StandardScaler: {data_scientists[:n]}")
print(f"Use MinMaxScaler: {data_scientists[n:]}")

Use StandardScaler: ['Dillan', 'Mason', 'Leon', 'Gaukhar', 'Rachel']
Use MinMaxScaler: ['Anthony', 'Harinder', 'James', 'Josh', 'Steve']


<IPython.core.display.Javascript object>

In [12]:
# Pick your poison (comment out the one that your group isn't doing)
scaler = StandardScaler()
# scaler = MinMaxScaler()

<IPython.core.display.Javascript object>

* Use a scaler to scale the `PTS` and `Age` data.
* The output of the scaler is a numpy array, convert this back to a dataframe
* Recreate the same box plots from before.
  * What's the same?
  * What's different?
  * What's the minimum value of the numeric axis? the max value?

In [17]:
# .fit() methods 'learn' something from your data
# They don't apply any of these learnings
# In the case of a scaler we have to call .transform
# Alternatively, we could use .fit_transform() to do
# both of these things in one step
scaler.fit(nba_sub)

scaled = scaler.transform(nba_sub)

# scaled_df = pd.DataFrame(data=scaled)
# scaled_df.head()

ValueError: could not convert string to float: 'ATL'

<IPython.core.display.Javascript object>

* Bonus: what attributes does your scaler have? What is the significance of these?
* Bonus Bonus: can you recreate this same scaling from scratch?