# NBA Dataset

Let's look at this [Kaggle NBA dataset](https://www.kaggle.com/drgilermo/nba-players-stats?select=Seasons_Stats.csv) which contains game stats on every NBA player per season since 1950 and answer a few questions like:
* What decade of basketball was the highest scoring? Which was the most physcial?
* How has scoring changed overtime among the Center position?
* How do the stats of some of the 'greatest of all time' players compare?

# Module Two

Import necessary modules

In [2]:
import pandas as pd
import numpy as np
from datetime import datetime as datetime
import altair as alt

In [3]:
nba = pd.read_csv("Seasons_Stats.csv")
nba.tail()

Unnamed: 0.1,Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
24686,24686,2017.0,Cody Zeller,PF,24.0,CHO,62.0,58.0,1725.0,16.7,...,0.679,135.0,270.0,405.0,99.0,62.0,58.0,65.0,189.0,639.0
24687,24687,2017.0,Tyler Zeller,C,27.0,BOS,51.0,5.0,525.0,13.0,...,0.564,43.0,81.0,124.0,42.0,7.0,21.0,20.0,61.0,178.0
24688,24688,2017.0,Stephen Zimmerman,C,20.0,ORL,19.0,0.0,108.0,7.3,...,0.6,11.0,24.0,35.0,4.0,2.0,5.0,3.0,17.0,23.0
24689,24689,2017.0,Paul Zipser,SF,22.0,CHI,44.0,18.0,843.0,6.9,...,0.775,15.0,110.0,125.0,36.0,15.0,16.0,40.0,78.0,240.0
24690,24690,2017.0,Ivica Zubac,C,19.0,LAL,38.0,11.0,609.0,17.0,...,0.653,41.0,118.0,159.0,30.0,14.0,33.0,30.0,66.0,284.0


Here we see that the data contains the Year, the player, their position, age, team, and then a bunch of in game statistics like points, assists, etc. It is important to note that each player appears in this dataset for however many seasons they played in. For example, if Michael Jordan played 15 seasons then he has 15 entries in this dataset for each season.

Fix the Year column

In [4]:
nba["Year"] = nba["Year"].apply(lambda x: str(x)[:-2])

## Question 1: What decade of basketball was the highest scoring? Which was the most physcial?

In [5]:
def get_decade(string):
    decade = string[2] + '0'
    return decade

Get rid of the null values with a value of "n"

In [6]:
nba = nba[nba.Year != "n"]

In [7]:
nba["Decade"] = nba["Year"].apply(get_decade)

Groupby decade with aggregation functions on PTS, Personal Fouls, etc.

In [8]:
groupby_decade = nba.groupby("Decade").agg({"PTS": np.sum, "3P": np.sum, "Age": np.mean, "AST": np.sum, "FG%": np.mean, 
                                           "MP": np.mean, "FTA": np.sum, "TRB": np.sum, "BLK": np.sum, "PF": np.sum})

In [9]:
groupby_decade

Unnamed: 0_level_0,PTS,3P,Age,AST,FG%,MP,FTA,TRB,BLK,PF
Decade,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,2532236.0,146743.0,27.117788,564536.0,0.42731,1166.032525,652351.0,1095502.0,127719.0,577879.0
10,2110087.0,160314.0,26.606888,462086.0,0.434789,1072.214826,488152.0,892202.0,101875.0,429262.0
50,651403.0,0.0,26.056172,151177.0,0.343325,1307.028008,254687.0,311303.0,0.0,194143.0
60,953118.0,0.0,25.944882,187421.0,0.40745,1437.918397,299991.0,495892.0,0.0,218128.0
70,1757415.0,0.0,26.169825,396481.0,0.437611,1381.831579,461570.0,794310.0,51408.0,403031.0
80,2199111.0,20474.0,26.274541,518623.0,0.4574,1282.531496,586353.0,883683.0,108151.0,507583.0
90,2357735.0,92933.0,27.190311,547807.0,0.438846,1151.105435,624849.0,988492.0,119755.0,534711.0


In [10]:
groupby_decade.reset_index(inplace=True)

In [11]:
groupby_decade.fillna(0)

Unnamed: 0,Decade,PTS,3P,Age,AST,FG%,MP,FTA,TRB,BLK,PF
0,0,2532236.0,146743.0,27.117788,564536.0,0.42731,1166.032525,652351.0,1095502.0,127719.0,577879.0
1,10,2110087.0,160314.0,26.606888,462086.0,0.434789,1072.214826,488152.0,892202.0,101875.0,429262.0
2,50,651403.0,0.0,26.056172,151177.0,0.343325,1307.028008,254687.0,311303.0,0.0,194143.0
3,60,953118.0,0.0,25.944882,187421.0,0.40745,1437.918397,299991.0,495892.0,0.0,218128.0
4,70,1757415.0,0.0,26.169825,396481.0,0.437611,1381.831579,461570.0,794310.0,51408.0,403031.0
5,80,2199111.0,20474.0,26.274541,518623.0,0.4574,1282.531496,586353.0,883683.0,108151.0,507583.0
6,90,2357735.0,92933.0,27.190311,547807.0,0.438846,1151.105435,624849.0,988492.0,119755.0,534711.0


Plot

In [12]:
alt.Chart(groupby_decade).mark_circle().encode(
    x = "PTS",
    y = "PF",
    color =alt.Color('AST', scale=alt.Scale(scheme='spectral')),
    size="FTA",
    tooltip=["Decade","PTS","PF","AST","FTA"]
)

This scatterplot shows us that (surprisingly) their tends to be a linear relationship between points scored and personal fouls committed. This may be due to the increase of easy free throws made if there are more fouls in a given season, thus increasing points. 

Additionally, we find out that the 2000s decade was the most physical and the most high scoring. This is followed by the 90s and 80s. I would have predicted that the 2010s was the highest scoring due to the rise of the 3 point shot, but the data says otherwise. 

Also, a limitation of the data is that the 50s and 60s decades played less games than later years, so their data is skewed lower. 

## Question 2: How has scoring changed overtime among the Center position?

Groupby position with aggregations on points and 3 Pointers made

In [13]:
pos_gb = nba.groupby(["Pos","Year"]).agg({"PTS": np.mean, "3P": np.mean})

In [14]:
pos_gb.reset_index(inplace=True)

Make the Year column a datetime so we can easily graph it out by time

In [15]:
pos_gb["Year"] = pos_gb["Year"].apply(lambda x: datetime(int(x), 5, 17).year)

Focus on just centers

In [16]:
centers = pos_gb[pos_gb["Pos"]=="C"] 

In [17]:
c1 = alt.Chart(centers, title="Average Points Scored by Centers Per Season").mark_line().encode(
    x='Year',
    y='PTS'
)

c2 = alt.Chart(centers, title="Average 3 Pointers Scored by Centers Per Season").mark_line().encode(
    x='Year',
    y='3P'
)

c1 | c2

This visualization shows us first that the Center position in the NBA has greatly decreased in scoring. Back in the 60s the Center was the most dominant postion. However, now a center's average points per season is almost cut in half from its peak back in the 60s. This is due to a number of things like the increase of the 3 point shot (which centers do not thrive on) and the increase of "small ball" (using shorter lineups).

The visualization on the right shows us that even though the average scoring amongst centers has sharply declined, their average 3 pointers made has sharply risen. Expecially in the past 10 years centers have started taking and making a dramatic amount more of 3 pointers. This is due to the revolution in NBA basketball of prioritizing 3 point shots rather than contested 2 point shots. This graph clearly shows how the center position has adapted to modern NBA play.

## Question 3: How do the stats of some of the 'greatest of all time' players compare?

Let's compare three of the greatest players of all time:
* LeBron James
* Michael Jordan
* Kobe Bryant

To visualize these players we will give them a simple artificial aggregated scores that combines their total points, assists, rebounds, steals, and blocks.

In [18]:
lbj = nba[nba["Player"]=="LeBron James"]
lbj["Season"] = [i for i in range(1,15)]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [19]:
lbj["Total Year Score"] = lbj["PTS"] + lbj["TRB"] + lbj["AST"] + lbj["STL"] + lbj["BLK"]
lbj.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0.1,Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,...,TRB,AST,STL,BLK,TOV,PF,PTS,Decade,Season,Total Year Score
16746,16746,2004,LeBron James,SG,19.0,CLE,79.0,79.0,3122.0,18.3,...,432.0,465.0,130.0,58.0,273.0,149.0,1654.0,0,1,2739.0
17344,17344,2005,LeBron James,SF,20.0,CLE,80.0,80.0,3388.0,25.7,...,588.0,577.0,177.0,52.0,262.0,146.0,2175.0,0,2,3569.0
17918,17918,2006,LeBron James,SF,21.0,CLE,79.0,79.0,3361.0,28.1,...,556.0,521.0,123.0,66.0,260.0,181.0,2478.0,0,3,3744.0
18460,18460,2007,LeBron James,SF,22.0,CLE,78.0,78.0,3190.0,24.5,...,526.0,470.0,125.0,55.0,250.0,171.0,2132.0,0,4,3308.0
19017,19017,2008,LeBron James,SF,23.0,CLE,75.0,74.0,3027.0,29.1,...,592.0,539.0,138.0,81.0,255.0,165.0,2250.0,0,5,3600.0


Repeat for Michael Jordan and Kobe Bryant

Also we must only look at each players first 14 seasons because at the time of the collection of this data LeBron had only played 14 seasons (the lowest of the three).

In [20]:
mj = nba[nba["Player"] == "Michael Jordan*"]
mj["Total Year Score"] = mj["PTS"] + mj["TRB"] + mj["AST"] + mj["STL"] + mj["BLK"]
mj["Season"] = [i for i in range(1,16)]
mj = mj.head(14) 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [21]:
kobe = nba[nba["Player"] == "Kobe Bryant"]
kobe["Total Year Score"] = kobe["PTS"] + kobe["TRB"] + kobe["AST"] + kobe["STL"] + kobe["BLK"]
kobe["Season"] = [i for i in range(1,21)]
kobe = kobe.head(14)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [22]:
all3 = pd.concat([lbj, mj, kobe])

In [23]:
alt.Chart(all3, title="Stats: LeBron vs MJ vs Kobe").mark_bar().encode(
    x='Season',
    y='Total Year Score',
    color='Player',
    tooltip= ["Player","Total Year Score"]
)

According to this visualization, Michael Jordan performed the best of the 3 during the beginning of their careers and he also had the highest peak. While, Kobe performed better torwards the end of their careers and didn't do as well torwards the beginning of his career. LeBron, suprisingly, only "won" two seasons: the second season and the fourteenth season.  However, one could predict that if this data included the past few season of LeBron's career that he would have "won" these later seasons due to his outstanding performance late in his career. 

# Module Three

In [24]:
all_positions = nba.groupby(["Pos","Year"]).agg({"PTS": np.mean, "AST": np.mean})

In [25]:
all_positions.reset_index(inplace=True)

In [26]:
center = all_positions[all_positions["Pos"]=="C"]
pf = all_positions[all_positions["Pos"]=="PF"]
sf = all_positions[all_positions["Pos"]=="SF"]
sg = all_positions[all_positions["Pos"]=="SG"]
pg = all_positions[all_positions["Pos"]=="PG"]

In [27]:
all_positions_df = pd.concat([pg,sg,sf,pf,center])

In [28]:
all_positions_df

Unnamed: 0,Pos,Year,PTS,AST
288,PG,1950,454.933333,163.733333
289,PG,1951,612.222222,247.222222
290,PG,1952,512.055556,205.111111
291,PG,1953,374.500000,135.343750
292,PG,1954,452.681818,186.636364
...,...,...,...,...
63,C,2013,375.080357,48.642857
64,C,2014,363.882883,49.162162
65,C,2015,440.580000,56.810000
66,C,2016,433.750000,57.423077


## Final Visualization

In [30]:
source = all_positions_df

selector = alt.selection_single(empty='all', fields=['Pos'])

color_scale = alt.Scale(domain=['PG', 'SG',"SF","PF","C"],
                        range=['#E309F9', '#098CF9','#8CF909','#F9BB09','#F93309'])

base = alt.Chart(source).properties(
    width=250,
    height=250
).add_selection(selector)

points = base.mark_point(filled=True, size=200).encode(
    x=alt.X('mean(PTS):Q',
            scale=alt.Scale(domain=[0,700])),
    y=alt.Y('mean(AST):Q',
            scale=alt.Scale(domain=[0,300])),
    color=alt.condition(selector,
                        'Pos:N',
                        alt.value('lightgray'),
                        scale=color_scale),
    tooltip = ["Pos"]
)

hists = alt.Chart(source).mark_line().encode(
    x='Year',
    y='PTS'
,
    color=alt.Color('Pos:N',
                    scale=color_scale),
    tooltip = ["Year","PTS"]
).transform_filter(
    selector
)


points | hists