# Sprint 4 Software Development Tools: Project

## Project Description

You are asked to to develop and deplot a web application to a cloud service so that it is accessible to the public.

In this project, I will use the dataset from [[www.basketball-reference.com](https://www.basketball-reference.com)]. The goal is to find out in the decade of the 2020s regular season, who is the "best" player of this decade so far. With this, I will gather the dataset from each year starting from 2020-2021 season to the end of the season of 2023-2024 (can not do the 2025 season just yet because of the ongoing season). 

## Import Packages 

In [438]:
import pandas as pd
import lxml
import plotly.express as px

## 2020-2021

In [439]:
#Retrieve HTML table dataset
url_2020 = 'https://www.basketball-reference.com/leagues/NBA_2021_per_game.html'
html = pd.read_html(url_2020, header = 0)
df_2020 = html[0]

df_2020.head()

Unnamed: 0,Rk,Player,Age,Team,Pos,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,eFG%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Awards
0,1.0,Stephen Curry,32.0,GSW,PG,63.0,63.0,34.2,10.4,21.7,0.482,5.3,12.7,0.421,5.1,9.0,0.569,0.605,5.7,6.3,0.916,0.5,5.0,5.5,5.8,1.2,0.1,3.4,1.9,32.0,"MVP-3,AS,NBA1"
1,2.0,Bradley Beal,27.0,WAS,SG,60.0,60.0,35.8,11.2,23.0,0.485,2.2,6.2,0.349,9.0,16.8,0.535,0.532,6.8,7.7,0.889,1.2,3.5,4.7,4.4,1.2,0.4,3.1,2.3,31.3,"AS,NBA3"
2,3.0,Damian Lillard,30.0,POR,PG,67.0,67.0,35.8,9.0,19.9,0.451,4.1,10.5,0.391,4.9,9.4,0.519,0.554,6.7,7.2,0.928,0.5,3.7,4.2,7.5,0.9,0.3,3.0,1.5,28.8,"MVP-7,AS,NBA2"
3,4.0,Joel Embiid,26.0,PHI,C,51.0,51.0,31.1,9.0,17.6,0.513,1.1,3.0,0.377,7.9,14.6,0.541,0.545,9.2,10.7,0.859,2.2,8.4,10.6,2.8,1.0,1.4,3.1,2.4,28.5,"MVP-2,DPOY-7,AS,NBA2"
4,5.0,Giannis Antetokounmpo,26.0,MIL,PF,61.0,61.0,33.0,10.3,18.0,0.569,1.1,3.6,0.303,9.2,14.4,0.636,0.6,6.5,9.5,0.685,1.6,9.4,11.0,5.9,1.2,1.2,3.4,2.8,28.1,"MVP-4,DPOY-5,AS,NBA1"


### Data Cleaning

In [440]:
# Dimensions of the dataframe
df_2020.shape


(706, 31)

In [441]:
# Check for missing values
df_2020.isnull().sum()

Rk          1
Player      0
Age         1
Team        1
Pos         1
G           1
GS          1
MP          1
FG          1
FGA         1
FG%         2
3P          1
3PA         1
3P%        35
2P          1
2PA         1
2P%         6
eFG%        2
FT          1
FTA         1
FT%        29
ORB         1
DRB         1
TRB         1
AST         1
STL         1
BLK         1
TOV         1
PF          1
PTS         1
Awards    651
dtype: int64

In [442]:
# Fill missing values with 0
df_2020 = df_2020.fillna(0)
df_2020.isnull().sum()

Rk        0
Player    0
Age       0
Team      0
Pos       0
G         0
GS        0
MP        0
FG        0
FGA       0
FG%       0
3P        0
3PA       0
3P%       0
2P        0
2PA       0
2P%       0
eFG%      0
FT        0
FTA       0
FT%       0
ORB       0
DRB       0
TRB       0
AST       0
STL       0
BLK       0
TOV       0
PF        0
PTS       0
Awards    0
dtype: int64

In [443]:
# Drop columns that are not required
df_2020 = df_2020.drop(columns= ['GS', 'MP', 'Rk', 'Awards', 'PF', '2P', '2PA', 'FT', 'FTA', 'eFG%'])
# Remove duplicate headers
df_2020 = df_2020[df_2020["Player"] != "Player"]

# Add Season column
df_2020["Season"] = "2020-21"

df_2020.head()

Unnamed: 0,Player,Age,Team,Pos,G,FG,FGA,FG%,3P,3PA,3P%,2P%,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PTS,Season
0,Stephen Curry,32.0,GSW,PG,63.0,10.4,21.7,0.482,5.3,12.7,0.421,0.569,0.916,0.5,5.0,5.5,5.8,1.2,0.1,3.4,32.0,2020-21
1,Bradley Beal,27.0,WAS,SG,60.0,11.2,23.0,0.485,2.2,6.2,0.349,0.535,0.889,1.2,3.5,4.7,4.4,1.2,0.4,3.1,31.3,2020-21
2,Damian Lillard,30.0,POR,PG,67.0,9.0,19.9,0.451,4.1,10.5,0.391,0.519,0.928,0.5,3.7,4.2,7.5,0.9,0.3,3.0,28.8,2020-21
3,Joel Embiid,26.0,PHI,C,51.0,9.0,17.6,0.513,1.1,3.0,0.377,0.541,0.859,2.2,8.4,10.6,2.8,1.0,1.4,3.1,28.5,2020-21
4,Giannis Antetokounmpo,26.0,MIL,PF,61.0,10.3,18.0,0.569,1.1,3.6,0.303,0.636,0.685,1.6,9.4,11.0,5.9,1.2,1.2,3.4,28.1,2020-21


### Write to CSV File

In [444]:
df_2020.to_csv('nba2020.csv', index = False)

In [445]:
df_2020 = pd.read_csv('nba2020.csv')
#Display all columns
pd.set_option('display.max_columns', None)
df_2020.head()

Unnamed: 0,Player,Age,Team,Pos,G,FG,FGA,FG%,3P,3PA,3P%,2P%,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PTS,Season
0,Stephen Curry,32.0,GSW,PG,63.0,10.4,21.7,0.482,5.3,12.7,0.421,0.569,0.916,0.5,5.0,5.5,5.8,1.2,0.1,3.4,32.0,2020-21
1,Bradley Beal,27.0,WAS,SG,60.0,11.2,23.0,0.485,2.2,6.2,0.349,0.535,0.889,1.2,3.5,4.7,4.4,1.2,0.4,3.1,31.3,2020-21
2,Damian Lillard,30.0,POR,PG,67.0,9.0,19.9,0.451,4.1,10.5,0.391,0.519,0.928,0.5,3.7,4.2,7.5,0.9,0.3,3.0,28.8,2020-21
3,Joel Embiid,26.0,PHI,C,51.0,9.0,17.6,0.513,1.1,3.0,0.377,0.541,0.859,2.2,8.4,10.6,2.8,1.0,1.4,3.1,28.5,2020-21
4,Giannis Antetokounmpo,26.0,MIL,PF,61.0,10.3,18.0,0.569,1.1,3.6,0.303,0.636,0.685,1.6,9.4,11.0,5.9,1.2,1.2,3.4,28.1,2020-21


## 2021-2022

In [446]:
#Retrieve HTML table dataset
url_2021 = 'https://www.basketball-reference.com/leagues/NBA_2022_per_game.html'
html = pd.read_html(url_2021, header = 0)
df_2021 = html[0]

df_2021.head()

Unnamed: 0,Rk,Player,Age,Team,Pos,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,eFG%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Awards
0,1.0,Joel Embiid,27.0,PHI,C,68.0,68.0,33.8,9.8,19.6,0.499,1.4,3.7,0.371,8.4,15.9,0.529,0.534,9.6,11.8,0.814,2.1,9.6,11.7,4.2,1.1,1.5,3.1,2.7,30.6,"MVP-2,AS,NBA2"
1,2.0,LeBron James,37.0,LAL,C,56.0,56.0,37.2,11.4,21.8,0.524,2.9,8.0,0.359,8.6,13.8,0.62,0.59,4.5,6.0,0.756,1.1,7.1,8.2,6.2,1.3,1.1,3.5,2.2,30.3,"MVP-10,AS,NBA3"
2,3.0,Giannis Antetokounmpo,27.0,MIL,PF,67.0,67.0,32.9,10.3,18.6,0.553,1.1,3.6,0.293,9.2,15.0,0.616,0.582,8.3,11.4,0.722,2.0,9.6,11.6,5.8,1.1,1.4,3.3,3.2,29.9,"MVP-3,DPOY-6,AS,NBA1"
3,4.0,Kevin Durant,33.0,BRK,PF,55.0,55.0,37.2,10.5,20.3,0.518,2.1,5.5,0.383,8.4,14.8,0.568,0.57,6.8,7.4,0.91,0.5,6.9,7.4,6.4,0.9,0.9,3.5,2.1,29.9,"MVP-10,AS,NBA2"
4,5.0,Luka Dončić,22.0,DAL,PG,65.0,65.0,35.4,9.9,21.6,0.457,3.1,8.8,0.353,6.8,12.8,0.528,0.529,5.6,7.5,0.744,0.9,8.3,9.1,8.7,1.2,0.6,4.5,2.2,28.4,"MVP-5,AS,NBA1"


### Data Cleaning

In [447]:
# Check for missing values
df_2021.isnull().sum()

Rk          1
Player      0
Age         1
Team        1
Pos         1
G           1
GS          1
MP          1
FG          1
FGA         1
FG%        15
3P          1
3PA         1
3P%        72
2P          1
2PA         1
2P%        28
eFG%       15
FT          1
FTA         1
FT%        97
ORB         1
DRB         1
TRB         1
AST         1
STL         1
BLK         1
TOV         1
PF          1
PTS         1
Awards    758
dtype: int64

In [448]:
# Fill missing values with 0
df_2021 = df_2021.fillna(0)
df_2021.isnull().sum()

Rk        0
Player    0
Age       0
Team      0
Pos       0
G         0
GS        0
MP        0
FG        0
FGA       0
FG%       0
3P        0
3PA       0
3P%       0
2P        0
2PA       0
2P%       0
eFG%      0
FT        0
FTA       0
FT%       0
ORB       0
DRB       0
TRB       0
AST       0
STL       0
BLK       0
TOV       0
PF        0
PTS       0
Awards    0
dtype: int64

In [449]:
# Drop columns that are not required
df_2021 = df_2021.drop(columns= ['GS', 'MP', 'Rk', 'Awards', 'PF', '2P', '2PA', 'FT', 'FTA', 'eFG%'])
# Remove duplicate headers
df_2021 = df_2021[df_2021["Player"] != "Player"]

# Add Season column
df_2021["Season"] = "2021-22"

df_2021.head()

Unnamed: 0,Player,Age,Team,Pos,G,FG,FGA,FG%,3P,3PA,3P%,2P%,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PTS,Season
0,Joel Embiid,27.0,PHI,C,68.0,9.8,19.6,0.499,1.4,3.7,0.371,0.529,0.814,2.1,9.6,11.7,4.2,1.1,1.5,3.1,30.6,2021-22
1,LeBron James,37.0,LAL,C,56.0,11.4,21.8,0.524,2.9,8.0,0.359,0.62,0.756,1.1,7.1,8.2,6.2,1.3,1.1,3.5,30.3,2021-22
2,Giannis Antetokounmpo,27.0,MIL,PF,67.0,10.3,18.6,0.553,1.1,3.6,0.293,0.616,0.722,2.0,9.6,11.6,5.8,1.1,1.4,3.3,29.9,2021-22
3,Kevin Durant,33.0,BRK,PF,55.0,10.5,20.3,0.518,2.1,5.5,0.383,0.568,0.91,0.5,6.9,7.4,6.4,0.9,0.9,3.5,29.9,2021-22
4,Luka Dončić,22.0,DAL,PG,65.0,9.9,21.6,0.457,3.1,8.8,0.353,0.528,0.744,0.9,8.3,9.1,8.7,1.2,0.6,4.5,28.4,2021-22


### Write to CSV File

In [450]:
df_2021.to_csv('nba2021.csv', index = False)

In [451]:
df_2021 = pd.read_csv('nba2021.csv')
#Display all columns
pd.set_option('display.max_columns', None)
df_2021.head()

Unnamed: 0,Player,Age,Team,Pos,G,FG,FGA,FG%,3P,3PA,3P%,2P%,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PTS,Season
0,Joel Embiid,27.0,PHI,C,68.0,9.8,19.6,0.499,1.4,3.7,0.371,0.529,0.814,2.1,9.6,11.7,4.2,1.1,1.5,3.1,30.6,2021-22
1,LeBron James,37.0,LAL,C,56.0,11.4,21.8,0.524,2.9,8.0,0.359,0.62,0.756,1.1,7.1,8.2,6.2,1.3,1.1,3.5,30.3,2021-22
2,Giannis Antetokounmpo,27.0,MIL,PF,67.0,10.3,18.6,0.553,1.1,3.6,0.293,0.616,0.722,2.0,9.6,11.6,5.8,1.1,1.4,3.3,29.9,2021-22
3,Kevin Durant,33.0,BRK,PF,55.0,10.5,20.3,0.518,2.1,5.5,0.383,0.568,0.91,0.5,6.9,7.4,6.4,0.9,0.9,3.5,29.9,2021-22
4,Luka Dončić,22.0,DAL,PG,65.0,9.9,21.6,0.457,3.1,8.8,0.353,0.528,0.744,0.9,8.3,9.1,8.7,1.2,0.6,4.5,28.4,2021-22


## 2022-2023


In [452]:
#Retrieve HTML table dataset
url_2022 = 'https://www.basketball-reference.com/leagues/NBA_2023_per_game.html'
html = pd.read_html(url_2022, header = 0)
df_2022 = html[0]

df_2022.head()

Unnamed: 0,Rk,Player,Age,Team,Pos,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,eFG%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Awards
0,1.0,Joel Embiid,28.0,PHI,C,66.0,66.0,34.6,11.0,20.1,0.548,1.0,3.0,0.33,10.0,17.1,0.587,0.573,10.0,11.7,0.857,1.7,8.4,10.2,4.2,1.0,1.7,3.4,3.1,33.1,"MVP-1,DPOY-9,CPOY-5,AS,NBA1"
1,2.0,Luka Dončić,23.0,DAL,PG,66.0,66.0,36.2,10.9,22.0,0.496,2.8,8.2,0.342,8.1,13.8,0.588,0.56,7.8,10.5,0.742,0.8,7.8,8.6,8.0,1.4,0.5,3.6,2.5,32.4,"MVP-8,CPOY-8,AS,NBA1"
2,3.0,Damian Lillard,32.0,POR,PG,58.0,58.0,36.3,9.6,20.7,0.463,4.2,11.3,0.371,5.4,9.4,0.574,0.564,8.8,9.6,0.914,0.8,4.0,4.8,7.3,0.9,0.3,3.3,1.9,32.2,"CPOY-10,AS,NBA3"
3,4.0,Shai Gilgeous-Alexander,24.0,OKC,PG,68.0,68.0,35.5,10.4,20.3,0.51,0.9,2.5,0.345,9.5,17.8,0.533,0.531,9.8,10.9,0.905,0.9,4.0,4.8,5.5,1.6,1.0,2.8,2.8,31.4,"MVP-5,CPOY-7,AS,NBA1"
4,5.0,Giannis Antetokounmpo,28.0,MIL,PF,63.0,63.0,32.1,11.2,20.3,0.553,0.7,2.7,0.275,10.5,17.6,0.596,0.572,7.9,12.3,0.645,2.2,9.6,11.8,5.7,0.8,0.8,3.9,3.1,31.1,"MVP-3,DPOY-6,AS,NBA1"


### Data Cleaning

In [453]:
# Check for missing values
df_2022.isnull().sum()

Rk          1
Player      0
Age         1
Team        1
Pos         1
G           1
GS          1
MP          1
FG          1
FGA         1
FG%         3
3P          1
3PA         1
3P%        24
2P          1
2PA         1
2P%         7
eFG%        3
FT          1
FTA         1
FT%        37
ORB         1
DRB         1
TRB         1
AST         1
STL         1
BLK         1
TOV         1
PF          1
PTS         1
Awards    630
dtype: int64

In [454]:
# Fill missing values with 0
df_2022 = df_2022.fillna(0)
df_2022.isnull().sum()

Rk        0
Player    0
Age       0
Team      0
Pos       0
G         0
GS        0
MP        0
FG        0
FGA       0
FG%       0
3P        0
3PA       0
3P%       0
2P        0
2PA       0
2P%       0
eFG%      0
FT        0
FTA       0
FT%       0
ORB       0
DRB       0
TRB       0
AST       0
STL       0
BLK       0
TOV       0
PF        0
PTS       0
Awards    0
dtype: int64

In [455]:
# Drop columns that are not required
df_2022 = df_2022.drop(columns= ['GS', 'MP', 'Rk', 'Awards', 'PF', '2P', '2PA', 'FT', 'FTA', 'eFG%'])
# Remove duplicate headers
df_2022 = df_2022[df_2022["Player"] != "Player"]

# Add Season column
df_2022["Season"] = "2022-23"

df_2022.head()

Unnamed: 0,Player,Age,Team,Pos,G,FG,FGA,FG%,3P,3PA,3P%,2P%,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PTS,Season
0,Joel Embiid,28.0,PHI,C,66.0,11.0,20.1,0.548,1.0,3.0,0.33,0.587,0.857,1.7,8.4,10.2,4.2,1.0,1.7,3.4,33.1,2022-23
1,Luka Dončić,23.0,DAL,PG,66.0,10.9,22.0,0.496,2.8,8.2,0.342,0.588,0.742,0.8,7.8,8.6,8.0,1.4,0.5,3.6,32.4,2022-23
2,Damian Lillard,32.0,POR,PG,58.0,9.6,20.7,0.463,4.2,11.3,0.371,0.574,0.914,0.8,4.0,4.8,7.3,0.9,0.3,3.3,32.2,2022-23
3,Shai Gilgeous-Alexander,24.0,OKC,PG,68.0,10.4,20.3,0.51,0.9,2.5,0.345,0.533,0.905,0.9,4.0,4.8,5.5,1.6,1.0,2.8,31.4,2022-23
4,Giannis Antetokounmpo,28.0,MIL,PF,63.0,11.2,20.3,0.553,0.7,2.7,0.275,0.596,0.645,2.2,9.6,11.8,5.7,0.8,0.8,3.9,31.1,2022-23


### Write to CSV File

In [456]:
df_2022.to_csv('nba2022.csv', index = False)

In [457]:
df_2022 = pd.read_csv('nba2022.csv')
#Display all columns
pd.set_option('display.max_columns', None)
df_2022.head()

Unnamed: 0,Player,Age,Team,Pos,G,FG,FGA,FG%,3P,3PA,3P%,2P%,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PTS,Season
0,Joel Embiid,28.0,PHI,C,66.0,11.0,20.1,0.548,1.0,3.0,0.33,0.587,0.857,1.7,8.4,10.2,4.2,1.0,1.7,3.4,33.1,2022-23
1,Luka Dončić,23.0,DAL,PG,66.0,10.9,22.0,0.496,2.8,8.2,0.342,0.588,0.742,0.8,7.8,8.6,8.0,1.4,0.5,3.6,32.4,2022-23
2,Damian Lillard,32.0,POR,PG,58.0,9.6,20.7,0.463,4.2,11.3,0.371,0.574,0.914,0.8,4.0,4.8,7.3,0.9,0.3,3.3,32.2,2022-23
3,Shai Gilgeous-Alexander,24.0,OKC,PG,68.0,10.4,20.3,0.51,0.9,2.5,0.345,0.533,0.905,0.9,4.0,4.8,5.5,1.6,1.0,2.8,31.4,2022-23
4,Giannis Antetokounmpo,28.0,MIL,PF,63.0,11.2,20.3,0.553,0.7,2.7,0.275,0.596,0.645,2.2,9.6,11.8,5.7,0.8,0.8,3.9,31.1,2022-23


## 2023-2024

In [458]:
#Retrieve HTML table dataset
url_2023 = 'https://www.basketball-reference.com/leagues/NBA_2024_per_game.html'
html = pd.read_html(url_2023, header = 0)
df_2023 = html[0]

df_2023.head()

Unnamed: 0,Rk,Player,Age,Team,Pos,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,eFG%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Awards
0,1.0,Joel Embiid,29.0,PHI,C,39.0,39.0,33.6,11.5,21.8,0.529,1.4,3.6,0.388,10.2,18.3,0.556,0.561,10.2,11.6,0.883,2.4,8.6,11.0,5.6,1.2,1.7,3.8,2.9,34.7,AS
1,2.0,Luka Dončić,24.0,DAL,PG,70.0,70.0,37.5,11.5,23.6,0.487,4.1,10.6,0.382,7.4,13.0,0.573,0.573,6.8,8.7,0.786,0.8,8.4,9.2,9.8,1.4,0.5,4.0,2.1,33.9,"MVP-3,CPOY-6,AS,NBA1"
2,3.0,Giannis Antetokounmpo,29.0,MIL,PF,73.0,73.0,35.2,11.5,18.8,0.611,0.5,1.7,0.274,11.0,17.1,0.645,0.624,7.0,10.7,0.657,2.7,8.8,11.5,6.5,1.2,1.1,3.4,2.9,30.4,"MVP-4,DPOY-9,CPOY-12,AS,NBA1"
3,4.0,Shai Gilgeous-Alexander,25.0,OKC,PG,75.0,75.0,34.0,10.6,19.8,0.535,1.3,3.6,0.353,9.3,16.2,0.576,0.567,7.6,8.7,0.874,0.9,4.7,5.5,6.2,2.0,0.9,2.2,2.5,30.1,"MVP-2,DPOY-7,CPOY-3,AS,NBA1"
4,5.0,Jalen Brunson,27.0,NYK,PG,77.0,77.0,35.4,10.3,21.4,0.479,2.7,6.8,0.401,7.5,14.6,0.516,0.543,5.5,6.5,0.847,0.6,3.1,3.6,6.7,0.9,0.2,2.4,1.9,28.7,"MVP-5,CPOY-5,AS,NBA2"


### Data Cleaning

In [459]:
# Check for missing values
df_2023.isnull().sum()

Rk          1
Player      0
Age         1
Team        1
Pos         1
G           1
GS          1
MP          1
FG          1
FGA         1
FG%         8
3P          1
3PA         1
3P%        46
2P          1
2PA         1
2P%        13
eFG%        8
FT          1
FTA         1
FT%        59
ORB         1
DRB         1
TRB         1
AST         1
STL         1
BLK         1
TOV         1
PF          1
PTS         1
Awards    681
dtype: int64

In [460]:
# Fill missing values with 0
df_2023 = df_2023.fillna(0)
df_2023.isnull().sum()

Rk        0
Player    0
Age       0
Team      0
Pos       0
G         0
GS        0
MP        0
FG        0
FGA       0
FG%       0
3P        0
3PA       0
3P%       0
2P        0
2PA       0
2P%       0
eFG%      0
FT        0
FTA       0
FT%       0
ORB       0
DRB       0
TRB       0
AST       0
STL       0
BLK       0
TOV       0
PF        0
PTS       0
Awards    0
dtype: int64

In [461]:
# Drop columns that are not required
df_2023 = df_2023.drop(columns= ['GS', 'MP', 'Rk', 'Awards', 'PF', '2P', '2PA', 'FT', 'FTA', 'eFG%'])
# Remove duplicate headers
df_2023 = df_2023[df_2023["Player"] != "Player"]

# Add Season column
df_2023["Season"] = "2023-24"

df_2023.head()

Unnamed: 0,Player,Age,Team,Pos,G,FG,FGA,FG%,3P,3PA,3P%,2P%,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PTS,Season
0,Joel Embiid,29.0,PHI,C,39.0,11.5,21.8,0.529,1.4,3.6,0.388,0.556,0.883,2.4,8.6,11.0,5.6,1.2,1.7,3.8,34.7,2023-24
1,Luka Dončić,24.0,DAL,PG,70.0,11.5,23.6,0.487,4.1,10.6,0.382,0.573,0.786,0.8,8.4,9.2,9.8,1.4,0.5,4.0,33.9,2023-24
2,Giannis Antetokounmpo,29.0,MIL,PF,73.0,11.5,18.8,0.611,0.5,1.7,0.274,0.645,0.657,2.7,8.8,11.5,6.5,1.2,1.1,3.4,30.4,2023-24
3,Shai Gilgeous-Alexander,25.0,OKC,PG,75.0,10.6,19.8,0.535,1.3,3.6,0.353,0.576,0.874,0.9,4.7,5.5,6.2,2.0,0.9,2.2,30.1,2023-24
4,Jalen Brunson,27.0,NYK,PG,77.0,10.3,21.4,0.479,2.7,6.8,0.401,0.516,0.847,0.6,3.1,3.6,6.7,0.9,0.2,2.4,28.7,2023-24


### Write to CSV File

In [462]:
df_2023.to_csv('nba2023.csv', index = False)

In [463]:
df_2023 = pd.read_csv('nba2023.csv')
#Display all columns
pd.set_option('display.max_columns', None)
df_2023.head()

Unnamed: 0,Player,Age,Team,Pos,G,FG,FGA,FG%,3P,3PA,3P%,2P%,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PTS,Season
0,Joel Embiid,29.0,PHI,C,39.0,11.5,21.8,0.529,1.4,3.6,0.388,0.556,0.883,2.4,8.6,11.0,5.6,1.2,1.7,3.8,34.7,2023-24
1,Luka Dončić,24.0,DAL,PG,70.0,11.5,23.6,0.487,4.1,10.6,0.382,0.573,0.786,0.8,8.4,9.2,9.8,1.4,0.5,4.0,33.9,2023-24
2,Giannis Antetokounmpo,29.0,MIL,PF,73.0,11.5,18.8,0.611,0.5,1.7,0.274,0.645,0.657,2.7,8.8,11.5,6.5,1.2,1.1,3.4,30.4,2023-24
3,Shai Gilgeous-Alexander,25.0,OKC,PG,75.0,10.6,19.8,0.535,1.3,3.6,0.353,0.576,0.874,0.9,4.7,5.5,6.2,2.0,0.9,2.2,30.1,2023-24
4,Jalen Brunson,27.0,NYK,PG,77.0,10.3,21.4,0.479,2.7,6.8,0.401,0.516,0.847,0.6,3.1,3.6,6.7,0.9,0.2,2.4,28.7,2023-24


# EDA

## Aggregate Data

In [464]:
# Combine all datasets
combined_df = pd.concat([df_2020, df_2021, df_2022, df_2023], ignore_index=True)

# Group by each unique player while handling non-numeric columns
grouped_df = combined_df.groupby('Player').agg({
    'Age': 'mean',
    'G': 'mean',
    'FG': 'mean',
    'FGA': 'mean',
    'FG%': 'mean',
    '3P': 'mean',
    '3PA': 'mean',
    '3P%': 'mean',
    '2P%': 'mean',
    'FT%': 'mean',
    'ORB': 'mean',
    'DRB': 'mean',
    'TRB': 'mean',
    'AST': 'mean',
    'STL': 'mean',
    'BLK': 'mean',
    'TOV': 'mean',
    'PTS': 'mean',
    'Team': lambda x: ', '.join(set(x.dropna())),  # Join all teams a player played for
    'Pos': lambda x: ', '.join(set(x.dropna()))  # Join all positions a player played
}).reset_index()

# Create a copy to avoid modifying grouped_df in-place
average_stats = grouped_df.copy()

# Apply decimal formatting only to numeric columns
average_stats = average_stats.assign(
    PTS = average_stats['PTS'].round(1),
    Age = average_stats['Age'].round(0),
    G = average_stats['G'].round(0),
    FG = average_stats['FG'].round(1),
    FGA = average_stats['FGA'].round(1),
    FGP = average_stats['FG%'].round(3),
    TP = average_stats['3P'].round(1),
    TPA = average_stats['3PA'].round(1),
    TPP = average_stats['3P%'].round(3),
    TWPP = average_stats['2P%'].round(3),
    FT = average_stats['FT%'].round(3),
    ORB = average_stats['ORB'].round(1),
    DRB = average_stats['DRB'].round(1),
    TRB = average_stats['TRB'].round(1),
    AST = average_stats['AST'].round(1),
    STL = average_stats['STL'].round(1),
    BLK = average_stats['BLK'].round(1),
    TOV = average_stats['TOV'].round(1)
)

# Drop unnecessary columns
average_stats = average_stats.drop(columns=['FG%', '3P%', 'FT%', '3P', '3PA', '2P%'])

# Reorder columns
column_order = ['Player', 'Team', 'Pos', 'Age', 'G', 'FG', 'FGA', 'FG%', '3P', '3PA', '3P%', '2P%', 'FT%', 
                'ORB', 'DRB', 'REB', 'AST', 'STL', 'BLK', 'TOV', 'PTS']

average_stats.rename(columns={'FGP': 'FG%', 'TP': '3P', 'TPA': '3PA', 'TPP': '3P%', 'TWPP': '2P%', 'FT': 'FT%', 'TRB': 'REB'}, inplace=True)

# Ensure we only select columns that exist in the DataFrame
average_stats = average_stats[[col for col in column_order if col in average_stats.columns]]

# Display first rows
average_stats.head()


Unnamed: 0,Player,Team,Pos,Age,G,FG,FGA,FG%,3P,3PA,3P%,2P%,FT%,ORB,DRB,REB,AST,STL,BLK,TOV,PTS
0,A.J. Green,MIL,SG,24.0,46.0,1.5,3.6,0.424,1.2,3.0,0.414,0.485,0.948,0.2,1.0,1.2,0.6,0.2,0.0,0.2,4.4
1,A.J. Lawson,"DAL, 2TM, MIN",SG,22.0,18.0,1.3,2.5,0.608,0.4,1.2,0.265,0.705,0.288,0.3,1.0,1.2,0.2,0.1,0.0,0.2,3.2
2,AJ Griffin,ATL,SF,20.0,46.0,2.2,5.2,0.378,1.0,2.8,0.323,0.442,0.947,0.3,1.2,1.5,0.6,0.4,0.2,0.5,5.6
3,Aaron Gordon,"ORL, 2TM, DEN",PF,26.0,53.0,5.2,10.3,0.507,1.0,3.1,0.325,0.578,0.666,1.8,4.1,6.0,3.1,0.7,0.7,1.7,13.7
4,Aaron Henry,PHI,SF,22.0,6.0,0.2,0.8,0.2,0.0,0.2,0.0,0.25,0.0,0.0,0.2,0.2,0.0,0.0,0.3,0.3,0.3


In [465]:
# Save the dataset to a CSV file
average_stats.to_csv('average_stats.csv', index = False)
average_stats = pd.read_csv('average_stats.csv')

__Who are the Top 15 scorers (PTS) in every season?__

In [466]:
# Top 15 players with the highest average points per game
top15_scorers = average_stats[average_stats.PTS == average_stats.PTS].sort_values(by='PTS', ascending=True).tail(15)
top15_scorers

Unnamed: 0,Player,Team,Pos,Age,G,FG,FGA,FG%,3P,3PA,3P%,2P%,FT%,ORB,DRB,REB,AST,STL,BLK,TOV,PTS
884,Zion Williamson,NOP,PF,22.0,53.0,9.7,16.3,0.596,0.2,0.5,0.332,0.605,0.705,2.1,4.5,6.7,4.4,1.0,0.6,3.0,25.3
662,Nikola Jokić,DEN,C,26.0,74.0,10.1,17.1,0.591,1.1,3.1,0.367,0.64,0.829,2.7,9.5,12.2,8.8,1.4,0.8,3.4,26.1
814,Trae Young,ATL,PG,24.0,66.0,8.3,18.9,0.439,2.7,7.3,0.358,0.49,0.883,0.6,2.7,3.4,10.0,1.0,0.2,4.2,26.4
228,Donovan Mitchell,"CLE, UTA","PG, SG",26.0,61.0,9.3,20.4,0.458,3.4,9.2,0.374,0.528,0.857,0.8,3.6,4.5,5.2,1.4,0.4,2.8,26.8
545,Kyrie Irving,"DAL, 2TM, BRK","PG, SG",30.0,44.0,9.9,20.1,0.494,3.1,7.8,0.396,0.556,0.913,0.9,4.0,4.9,5.6,1.2,0.7,2.2,26.8
215,Devin Booker,PHO,"PG, SG",26.0,64.0,9.6,19.8,0.484,2.2,6.2,0.36,0.539,0.869,0.7,3.9,4.6,5.4,1.0,0.3,2.7,26.8
165,Damian Lillard,"POR, MIL",PG,32.0,57.0,8.4,19.3,0.435,3.6,10.0,0.36,0.517,0.91,0.6,3.8,4.4,7.3,0.8,0.3,3.0,27.3
763,Shai Gilgeous-Alexander,OKC,"PG, SG",24.0,58.0,9.4,18.8,0.502,1.5,4.1,0.354,0.542,0.849,0.8,4.3,5.0,5.9,1.4,0.8,2.7,27.4
554,LeBron James,LAL,"PG, C, PF",38.0,57.0,10.4,20.0,0.519,2.4,6.6,0.364,0.596,0.743,1.0,6.9,7.9,7.3,1.2,0.7,3.5,27.5
413,Jayson Tatum,BOS,"PF, SF",24.0,72.0,9.4,20.4,0.462,3.0,8.4,0.366,0.532,0.852,1.0,7.1,8.1,4.6,1.1,0.6,2.8,27.6


__Who are the Top 15 rebounders in every season?__

In [467]:
# Top 15 rebounders with the highest average rebounds per game
top15_rebounders = average_stats[average_stats.REB == average_stats.REB].sort_values(by='REB', ascending=False).head(15)
top15_rebounders

Unnamed: 0,Player,Team,Pos,Age,G,FG,FGA,FG%,3P,3PA,3P%,2P%,FT%,ORB,DRB,REB,AST,STL,BLK,TOV,PTS
738,Rudy Gobert,"MIN, UTA",C,30.0,71.0,5.4,8.0,0.677,0.0,0.0,0.0,0.681,0.649,3.6,9.6,13.2,1.2,0.7,2.1,1.7,14.3
226,Domantas Sabonis,"2TM, SAC, IND","PF, C",25.0,58.0,7.4,12.9,0.575,0.5,1.7,0.324,0.613,0.734,3.1,9.3,12.4,6.4,1.0,0.5,3.2,19.2
662,Nikola Jokić,DEN,C,26.0,74.0,10.1,17.1,0.591,1.1,3.1,0.367,0.64,0.829,2.7,9.5,12.2,8.8,1.4,0.8,3.4,26.1
139,Clint Capela,ATL,C,28.0,69.0,5.4,9.0,0.608,0.0,0.0,0.0,0.608,0.57,4.3,7.7,12.0,1.0,0.7,1.5,0.9,12.4
291,Giannis Antetokounmpo,MIL,PF,28.0,66.0,10.8,18.9,0.572,0.8,2.9,0.286,0.623,0.677,2.1,9.4,11.5,6.0,1.1,1.1,3.5,29.9
664,Nikola Vučević,"ORL, 2TM, CHI",C,31.0,62.0,8.5,17.5,0.484,1.9,5.2,0.358,0.536,0.826,2.2,9.1,11.2,3.5,0.9,0.8,1.8,20.4
436,Joel Embiid,PHI,C,28.0,56.0,10.3,19.8,0.522,1.2,3.3,0.366,0.553,0.853,2.1,8.8,10.9,4.2,1.1,1.6,3.4,31.7
42,Anthony Davis,LAL,"PF, C",28.0,52.0,9.2,17.1,0.536,0.4,1.8,0.244,0.57,0.763,2.8,8.0,10.7,3.1,1.2,2.0,2.1,23.9
444,Jonas Valančiūnas,"NOP, MEM",C,30.0,74.0,6.1,11.0,0.56,0.5,1.5,0.346,0.594,0.801,3.1,7.6,10.7,2.1,0.5,0.8,1.8,15.3
855,Victor Wembanyama,SAS,C,20.0,71.0,7.8,16.7,0.465,1.8,5.5,0.325,0.534,0.796,2.3,8.4,10.6,3.9,1.2,3.6,3.7,21.4


__Who are the Top 10 players with the highest 3P%? (Attempts at least 5 three-pointers per game)__

In [None]:
# Top 30 players with the highest 3P% while attempting more than 5 threes a game.
top30_3pt = average_stats[average_stats['3PA'] > 5].sort_values(by='3P%', ascending=False).head(30)
top30_3pt

Unnamed: 0,Player,Team,Pos,Age,G,FG,FGA,FG%,3P,3PA,3P%,2P%,FT%,ORB,DRB,REB,AST,STL,BLK,TOV,PTS
517,Kevin Durant,"2TM, BRK, PHO",PF,34.0,43.0,10.0,18.3,0.545,2.2,5.2,0.427,0.59,0.889,0.4,6.4,6.8,5.1,0.7,1.3,3.2,28.1
667,Norman Powell,"POR, 2TM, TOR, LAC","SG, SF",28.0,46.0,6.0,12.7,0.476,2.3,5.5,0.426,0.516,0.841,0.5,2.5,3.0,1.9,0.9,0.4,1.5,18.2
300,Grayson Allen,"MEM, PHO, MIL",SG,26.0,66.0,3.8,8.4,0.451,2.3,5.6,0.415,0.523,0.879,0.6,2.8,3.4,2.2,0.8,0.3,1.0,11.4
214,Desmond Bane,MEM,SG,24.0,61.0,6.6,14.1,0.468,2.7,6.6,0.414,0.517,0.868,0.7,3.6,4.2,3.6,1.0,0.4,1.8,18.2
378,Jamal Murray,DEN,PG,25.0,57.0,7.7,16.4,0.471,2.6,6.3,0.41,0.509,0.852,0.7,3.3,4.0,5.8,1.1,0.4,2.2,20.8
776,Stephen Curry,GSW,PG,34.0,64.0,9.4,20.1,0.466,4.9,11.9,0.409,0.548,0.919,0.6,4.8,5.3,5.9,1.0,0.3,3.2,28.3
861,Wayne Ellington,"DET, LAL",SG,34.0,44.0,2.8,6.4,0.428,2.2,5.4,0.406,0.57,0.809,0.2,1.6,1.8,1.1,0.4,0.2,0.6,8.2
622,Mike Conley,"2TM, MIN, UTA",PG,35.0,56.0,4.4,9.9,0.439,2.3,5.6,0.405,0.482,0.845,0.6,2.4,2.9,6.1,1.2,0.2,1.6,13.0
472,Jrue Holiday,"BOS, MIL",PG,32.0,66.0,6.6,13.4,0.491,2.1,5.1,0.404,0.544,0.81,1.2,3.7,4.9,6.3,1.3,0.6,2.4,17.0
844,Tyrese Haliburton,"2TM, SAC, IND","PG, SG",21.0,56.0,6.1,12.8,0.478,2.4,5.9,0.403,0.545,0.852,0.7,3.1,3.8,8.6,1.6,0.6,2.4,16.8


In [None]:
# Top 15 playmakers with the highest average assists per game
top15_playmakers = average_stats[average_stats.AST == average_stats.AST].sort_values(by='AST', ascending=False).head(15)
top15_playmakers

Unnamed: 0,Player,Team,Pos,Age,G,FG,FGA,FG%,3P,3PA,3P%,2P%,FT%,ORB,DRB,REB,AST,STL,BLK,TOV,PTS
382,James Harden,"LAC, BRK, HOU, 2TM, PHI",PG,32.0,44.0,6.6,15.1,0.434,2.6,7.3,0.354,0.51,0.873,0.7,6.2,6.9,10.3,1.2,0.6,3.9,22.1
814,Trae Young,ATL,PG,24.0,66.0,8.3,18.9,0.439,2.7,7.3,0.358,0.49,0.883,0.6,2.7,3.4,10.0,1.0,0.2,4.2,26.4
567,Luka Dončić,DAL,PG,22.0,67.0,10.5,21.9,0.48,3.2,9.0,0.357,0.564,0.75,0.8,7.9,8.7,8.8,1.2,0.5,4.1,30.6
662,Nikola Jokić,DEN,C,26.0,74.0,10.1,17.1,0.591,1.1,3.1,0.367,0.64,0.829,2.7,9.5,12.2,8.8,1.4,0.8,3.4,26.1
133,Chris Paul,"GSW, PHO",PG,36.0,63.0,5.1,10.8,0.468,1.4,3.7,0.364,0.52,0.857,0.4,3.9,4.3,8.8,1.5,0.3,2.0,13.6
844,Tyrese Haliburton,"2TM, SAC, IND","PG, SG",21.0,56.0,6.1,12.8,0.478,2.4,5.9,0.403,0.545,0.852,0.7,3.1,3.8,8.6,1.6,0.6,2.4,16.8
740,Russell Westbrook,"WAS, LAC, 2TM, LAL",PG,34.0,60.0,6.3,14.1,0.446,1.1,3.6,0.308,0.494,0.663,1.4,5.4,6.8,7.6,1.1,0.4,3.5,16.6
338,Ja Morant,MEM,PG,22.0,48.0,8.8,18.6,0.47,1.4,4.7,0.307,0.527,0.762,1.0,4.3,5.3,7.6,1.0,0.4,3.2,24.4
547,LaMelo Ball,CHO,PG,20.0,46.0,7.4,17.3,0.427,3.0,8.0,0.368,0.475,0.833,1.3,4.8,6.0,7.5,1.6,0.3,3.4,20.8
165,Damian Lillard,"POR, MIL",PG,32.0,57.0,8.4,19.3,0.435,3.6,10.0,0.36,0.517,0.91,0.6,3.8,4.4,7.3,0.8,0.3,3.0,27.3


### Data Visuals

In [469]:
# Histogram of Points Per Game (PTS)
fig1 = px.histogram(average_stats, x='PTS', nbins=20, 
                   title='Distribution of Points Per Game (PTS)',
                   labels={'PTS': 'Points Per Game'},
                   opacity=0.8, color_discrete_sequence=['blue'])
fig1.show()

The histogram "Distribution of Points Per Game (PTS)" shows how spread out the entire NBA averages Points Per Game. It is skewed to the right as there are less players who score 15 or more than the mean which is about 5 PTS.

In [470]:
# Histogram of PTS by Position
# Filter just PTS and Pos columns
pts_pos = average_stats[['PTS', 'Pos']]

# Filter out players with multiple positions
position = ['PG', 'SG', 'SF', 'PF', 'C']
pts_pos = pts_pos[pts_pos['Pos'].isin(position)]

# Create histogram faceted by position
fig2 = px.histogram(pts_pos, x='PTS', facet_col='Pos', 
                   title='Points Per Game Distribution by Position',
                   labels={'PTS': 'Points Per Game', 'Pos': 'Position'},
                   opacity=1, nbins=20)

fig2.show()

Above shows 5 histograms, each assigned with their own position. This shows how many players in their own separate position are more frequent to average a certain amount of points. They all share the same mean which is about the 5 PTS mark.

In [471]:
# Horizontal bar chart of the top 15 players with the highest points per game
fig3 = px.bar(top15_scorers, x='PTS', y='Player', 
              title='Top 15 Players with the Highest Points Per Game',
              labels={'PTS': 'Points Per Game', 'Player': 'Player Name'},
              color='PTS', orientation = 'h')

fig3.show()

The bar graph above shows the top 15 players who have the highest points per game. The top scorer of 2020-2024 is Joel Embiid with 31.7 PTS. 

In [472]:
# Scatter plot of Points Per Game vs. Field Goal Percentage of Top 15 Scorers
fig4 = px.scatter(top15_scorers, x='PTS', y='FG%', 
                 title='Points Per Game vs. Field Goal Percentage',
                 labels={'PTS': 'Points Per Game', 'FG%': 'Field Goal Percentage'},
                 size='PTS', hover_name='Player')

fig4.show()

The scatterplot shows the Points vs. Field Goal Percentage, to show the efficiency of the top 30 scorers. Although, Joel Embiid is the top scorer, he is not the most efficient. The most efficient scorers are Zion Williamson scoring 25.3 PTS on 59.6% of shooting, Nikola Jokic scoring 26.1 PTS on 59.1% of shooting, and then Giannis Antetokounmpo scoring 29.9 PTS on 57.2% of shooting.

In [473]:
# Scatter plot of Defensive Rebounds vs. Offensive Rebounds of Top 15 Rebounders
fig4 = px.scatter(top15_rebounders, x='DRB', y='ORB', 
                 title='Defensive Rebounds vs. Offensive Rebounds',
                 labels={'DRB': 'Defensive Rebounds', 'ORB': 'Offensive Rebounds'},
                 size='REB', hover_name='Player')

fig4.show()

Above shows the top rebounders in the league, but with the amount of rebounds on either the offensive or defensive side. Rudy Gobert is the top rebounder, but is about 3:1 ratio of defensive to offensive rebounds, which means most of his rebounds come from the defensive side of the floor. The top offensive rebounder is Steven Adams with about 4.5 rebounds, but is ranked number 14 in total rebounds.

In [474]:
# Scatter plot of 3PT Attempts vs. 3PT Percentage of Top 30 3PT Shooters
fig6 = px.scatter(top30_3pt, x='3PA', y='3P%', 
                 title='3PT Attempts vs. 3PT Percentage (Shooting More Than 5 3PTs Per Game)',
                 labels={'3PA': '3-Point Attempts Per Game', '3P%': '3-Point Percentage'},
                 size='3PA', hover_name='Player')

fig6.show()

This shows the effiency of the top 30 three point shooters (in three point field percentages shooting more than 5 threes a game). There seems to be an outlier of this group, it being Stephen Curry with his attempts. He attempts about 11.9 threes a game, but shooting 40.9% from three. While the player who tends to make his threes more often, is Kevin Durant with 42.7% from three, but only shooting about 5.2 threes a game.

In [477]:
# Scatter plot of Assists vs. Turnovers (Playmaking Efficiency)
fig7 = px.scatter(top15_playmakers, x='AST', y='TOV', 
                 title='Assists vs. Turnovers (Playmaking Efficiency)',
                 labels={'AST': 'Assists Per Game', 'TOV': 'Turnovers Per Game'},
                 color='Pos', size='AST', hover_name='Player')

fig7.show()

This shows the assist vs. turnovers, meaning how efficient of a playmaker is. The less of turnovers is the better, but with the increase amount of assists tends to be more turnovers. Typically, the ratio to aim for in assists to turnover is 2:1. Anything less, would usually mean the player is not an elite playmaker, or the designated playmaker for their team. The player with best assist to turnover ratio is Chris Paul with about a 4:1 ratio.