# **Exploratory Data Analysis in Python using pandas**

In this Jupyter notebook, I will be showing you how to perform Exploratory Data Analysis on web scraped data of NBA player stats

## **Web scraping data using pandas**

The following block of code will retrieve the "2021-22 NBA Player Stats: Per Game" data from http://www.basketball-reference.com/.

In [4]:
import pandas as pd
#from html5lib import html5lib
# Retrieve HTML table data
url = 'https://www.basketball-reference.com/leagues/NBA_2022_per_game.html'
html = pd.read_html(url, header=0)
df2022 = html[0]

In [5]:
df2022

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,1,Precious Achiuwa,C,22,TOR,73,28,23.6,3.6,8.3,...,.595,2.0,4.5,6.5,1.1,0.5,0.6,1.2,2.1,9.1
1,2,Steven Adams,C,28,MEM,76,75,26.3,2.8,5.1,...,.543,4.6,5.4,10.0,3.4,0.9,0.8,1.5,2.0,6.9
2,3,Bam Adebayo,C,24,MIA,56,56,32.6,7.3,13.0,...,.753,2.4,7.6,10.1,3.4,1.4,0.8,2.6,3.1,19.1
3,4,Santi Aldama,PF,21,MEM,32,0,11.3,1.7,4.1,...,.625,1.0,1.7,2.7,0.7,0.2,0.3,0.5,1.1,4.1
4,5,LaMarcus Aldridge,C,36,BRK,47,12,22.3,5.4,9.7,...,.873,1.6,3.9,5.5,0.9,0.3,1.0,0.9,1.7,12.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
837,601,Thaddeus Young,PF,33,TOR,26,0,18.3,2.6,5.5,...,.481,1.5,2.9,4.4,1.7,1.2,0.4,0.8,1.7,6.3
838,602,Trae Young,PG,23,ATL,76,76,34.9,9.4,20.3,...,.904,0.7,3.1,3.7,9.7,0.9,0.1,4.0,1.7,28.4
839,603,Omer Yurtseven,C,23,MIA,56,12,12.6,2.3,4.4,...,.623,1.5,3.7,5.3,0.9,0.3,0.4,0.7,1.5,5.3
840,604,Cody Zeller,C,29,POR,27,0,13.1,1.9,3.3,...,.776,1.9,2.8,4.6,0.8,0.3,0.2,0.7,2.1,5.2


Check the "Age" column. Do we need to do anything?

In [6]:
df2022.Age.value_counts()

Age
24     104
23      81
22      73
26      68
25      65
28      60
27      53
29      50
21      48
31      42
30      33
20      31
Age     30
32      29
33      23
35      14
19      13
34      11
36       9
37       2
41       1
38       1
40       1
Name: count, dtype: int64

In [7]:
df = df2022.drop(df2022[df2022.Age == "Age"].index)
df.Age.value_counts()

Age
24    104
23     81
22     73
26     68
25     65
28     60
27     53
29     50
21     48
31     42
30     33
20     31
32     29
33     23
35     14
19     13
34     11
36      9
37      2
41      1
38      1
40      1
Name: count, dtype: int64

## **Acronyms**


Acronym | Description
---|---
Rk | Rank
Pos | Position
Age | Player's age on February 1 of the season
Tm | Team
G | Games
GS | Games Started
MP | Minutes Played Per Game
FG | Field Goals Per Game
FGA | Field Goal Attempts Per Game
FG% | Field Goal Percentage
3P | 3-Point Field Goals Per Game
3PA | 3-Point Field Goal Attempts Per Game
3P% | FG% on 3-Pt FGAs.
2P | 2-Point Field Goals Per Game
2PA | 2-Point Field Goal Attempts Per Game
2P% | FG% on 2-Pt FGAs.
eFG% | Effective Field Goal Percentage
| *(Note: This statistic adjusts for the fact that a 3-point field goal is worth one more point than a 2-point field goal.)*
FT | Free Throws Per Game
FTA | Free Throw Attempts Per Game
FT% | Free Throw Percentage
ORB | Offensive Rebounds Per Game
DRB | Defensive Rebounds Per Game
TRB | Total Rebounds Per Game
AST | Assists Per Game
STL | Steals Per Game
BLK | Blocks Per Game
TOV | Turnovers Per Game
PF | Personal Fouls Per Game
PTS | Points Per Game

## **Data cleaning**

### Data dimension

In [8]:
df.shape

(812, 30)

### Dataframe contents

In [9]:
df.head()

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,1,Precious Achiuwa,C,22,TOR,73,28,23.6,3.6,8.3,...,0.595,2.0,4.5,6.5,1.1,0.5,0.6,1.2,2.1,9.1
1,2,Steven Adams,C,28,MEM,76,75,26.3,2.8,5.1,...,0.543,4.6,5.4,10.0,3.4,0.9,0.8,1.5,2.0,6.9
2,3,Bam Adebayo,C,24,MIA,56,56,32.6,7.3,13.0,...,0.753,2.4,7.6,10.1,3.4,1.4,0.8,2.6,3.1,19.1
3,4,Santi Aldama,PF,21,MEM,32,0,11.3,1.7,4.1,...,0.625,1.0,1.7,2.7,0.7,0.2,0.3,0.5,1.1,4.1
4,5,LaMarcus Aldridge,C,36,BRK,47,12,22.3,5.4,9.7,...,0.873,1.6,3.9,5.5,0.9,0.3,1.0,0.9,1.7,12.9


### Check for missing values

In [10]:
df.isnull().sum()

Rk         0
Player     0
Pos        0
Age        0
Tm         0
G          0
GS         0
MP         0
FG         0
FGA        0
FG%       15
3P         0
3PA        0
3P%       72
2P         0
2PA        0
2P%       28
eFG%      15
FT         0
FTA        0
FT%       97
ORB        0
DRB        0
TRB        0
AST        0
STL        0
BLK        0
TOV        0
PF         0
PTS        0
dtype: int64

### Replace missing values with 0 

In [11]:
raw = df.fillna(0)
raw

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,1,Precious Achiuwa,C,22,TOR,73,28,23.6,3.6,8.3,...,.595,2.0,4.5,6.5,1.1,0.5,0.6,1.2,2.1,9.1
1,2,Steven Adams,C,28,MEM,76,75,26.3,2.8,5.1,...,.543,4.6,5.4,10.0,3.4,0.9,0.8,1.5,2.0,6.9
2,3,Bam Adebayo,C,24,MIA,56,56,32.6,7.3,13.0,...,.753,2.4,7.6,10.1,3.4,1.4,0.8,2.6,3.1,19.1
3,4,Santi Aldama,PF,21,MEM,32,0,11.3,1.7,4.1,...,.625,1.0,1.7,2.7,0.7,0.2,0.3,0.5,1.1,4.1
4,5,LaMarcus Aldridge,C,36,BRK,47,12,22.3,5.4,9.7,...,.873,1.6,3.9,5.5,0.9,0.3,1.0,0.9,1.7,12.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
837,601,Thaddeus Young,PF,33,TOR,26,0,18.3,2.6,5.5,...,.481,1.5,2.9,4.4,1.7,1.2,0.4,0.8,1.7,6.3
838,602,Trae Young,PG,23,ATL,76,76,34.9,9.4,20.3,...,.904,0.7,3.1,3.7,9.7,0.9,0.1,4.0,1.7,28.4
839,603,Omer Yurtseven,C,23,MIA,56,12,12.6,2.3,4.4,...,.623,1.5,3.7,5.3,0.9,0.3,0.4,0.7,1.5,5.3
840,604,Cody Zeller,C,29,POR,27,0,13.1,1.9,3.3,...,.776,1.9,2.8,4.6,0.8,0.3,0.2,0.7,2.1,5.2


In [12]:
raw.isnull().sum()

Rk        0
Player    0
Pos       0
Age       0
Tm        0
G         0
GS        0
MP        0
FG        0
FGA       0
FG%       0
3P        0
3PA       0
3P%       0
2P        0
2PA       0
2P%       0
eFG%      0
FT        0
FTA       0
FT%       0
ORB       0
DRB       0
TRB       0
AST       0
STL       0
BLK       0
TOV       0
PF        0
PTS       0
dtype: int64

In [13]:
raw = raw.drop(["Rk"], axis=1)
raw

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,Precious Achiuwa,C,22,TOR,73,28,23.6,3.6,8.3,.439,...,.595,2.0,4.5,6.5,1.1,0.5,0.6,1.2,2.1,9.1
1,Steven Adams,C,28,MEM,76,75,26.3,2.8,5.1,.547,...,.543,4.6,5.4,10.0,3.4,0.9,0.8,1.5,2.0,6.9
2,Bam Adebayo,C,24,MIA,56,56,32.6,7.3,13.0,.557,...,.753,2.4,7.6,10.1,3.4,1.4,0.8,2.6,3.1,19.1
3,Santi Aldama,PF,21,MEM,32,0,11.3,1.7,4.1,.402,...,.625,1.0,1.7,2.7,0.7,0.2,0.3,0.5,1.1,4.1
4,LaMarcus Aldridge,C,36,BRK,47,12,22.3,5.4,9.7,.550,...,.873,1.6,3.9,5.5,0.9,0.3,1.0,0.9,1.7,12.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
837,Thaddeus Young,PF,33,TOR,26,0,18.3,2.6,5.5,.465,...,.481,1.5,2.9,4.4,1.7,1.2,0.4,0.8,1.7,6.3
838,Trae Young,PG,23,ATL,76,76,34.9,9.4,20.3,.460,...,.904,0.7,3.1,3.7,9.7,0.9,0.1,4.0,1.7,28.4
839,Omer Yurtseven,C,23,MIA,56,12,12.6,2.3,4.4,.526,...,.623,1.5,3.7,5.3,0.9,0.3,0.4,0.7,1.5,5.3
840,Cody Zeller,C,29,POR,27,0,13.1,1.9,3.3,.567,...,.776,1.9,2.8,4.6,0.8,0.3,0.2,0.7,2.1,5.2


## **Exploratory Data Analysis**

#### Displays the dataframe

In [14]:
raw

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,Precious Achiuwa,C,22,TOR,73,28,23.6,3.6,8.3,.439,...,.595,2.0,4.5,6.5,1.1,0.5,0.6,1.2,2.1,9.1
1,Steven Adams,C,28,MEM,76,75,26.3,2.8,5.1,.547,...,.543,4.6,5.4,10.0,3.4,0.9,0.8,1.5,2.0,6.9
2,Bam Adebayo,C,24,MIA,56,56,32.6,7.3,13.0,.557,...,.753,2.4,7.6,10.1,3.4,1.4,0.8,2.6,3.1,19.1
3,Santi Aldama,PF,21,MEM,32,0,11.3,1.7,4.1,.402,...,.625,1.0,1.7,2.7,0.7,0.2,0.3,0.5,1.1,4.1
4,LaMarcus Aldridge,C,36,BRK,47,12,22.3,5.4,9.7,.550,...,.873,1.6,3.9,5.5,0.9,0.3,1.0,0.9,1.7,12.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
837,Thaddeus Young,PF,33,TOR,26,0,18.3,2.6,5.5,.465,...,.481,1.5,2.9,4.4,1.7,1.2,0.4,0.8,1.7,6.3
838,Trae Young,PG,23,ATL,76,76,34.9,9.4,20.3,.460,...,.904,0.7,3.1,3.7,9.7,0.9,0.1,4.0,1.7,28.4
839,Omer Yurtseven,C,23,MIA,56,12,12.6,2.3,4.4,.526,...,.623,1.5,3.7,5.3,0.9,0.3,0.4,0.7,1.5,5.3
840,Cody Zeller,C,29,POR,27,0,13.1,1.9,3.3,.567,...,.776,1.9,2.8,4.6,0.8,0.3,0.2,0.7,2.1,5.2


### Overview of data types of each columns in the dataframe

Write to csv file

In [15]:
raw.to_csv("Zia.csv", index = False)

### Show specific data types in dataframe

Read csv file

In [16]:
df = pd.read_csv("Zia.csv")

In [17]:
df.select_dtypes(include=['number'])

Unnamed: 0,Age,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,22,73,28,23.6,3.6,8.3,0.439,0.8,2.1,0.359,...,0.595,2.0,4.5,6.5,1.1,0.5,0.6,1.2,2.1,9.1
1,28,76,75,26.3,2.8,5.1,0.547,0.0,0.0,0.000,...,0.543,4.6,5.4,10.0,3.4,0.9,0.8,1.5,2.0,6.9
2,24,56,56,32.6,7.3,13.0,0.557,0.0,0.1,0.000,...,0.753,2.4,7.6,10.1,3.4,1.4,0.8,2.6,3.1,19.1
3,21,32,0,11.3,1.7,4.1,0.402,0.2,1.5,0.125,...,0.625,1.0,1.7,2.7,0.7,0.2,0.3,0.5,1.1,4.1
4,36,47,12,22.3,5.4,9.7,0.550,0.3,1.0,0.304,...,0.873,1.6,3.9,5.5,0.9,0.3,1.0,0.9,1.7,12.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
807,33,26,0,18.3,2.6,5.5,0.465,0.7,1.7,0.395,...,0.481,1.5,2.9,4.4,1.7,1.2,0.4,0.8,1.7,6.3
808,23,76,76,34.9,9.4,20.3,0.460,3.1,8.0,0.382,...,0.904,0.7,3.1,3.7,9.7,0.9,0.1,4.0,1.7,28.4
809,23,56,12,12.6,2.3,4.4,0.526,0.0,0.2,0.091,...,0.623,1.5,3.7,5.3,0.9,0.3,0.4,0.7,1.5,5.3
810,29,27,0,13.1,1.9,3.3,0.567,0.0,0.1,0.000,...,0.776,1.9,2.8,4.6,0.8,0.3,0.2,0.7,2.1,5.2


In [18]:
df.select_dtypes(include=['object'])

Unnamed: 0,Player,Pos,Tm
0,Precious Achiuwa,C,TOR
1,Steven Adams,C,MEM
2,Bam Adebayo,C,MIA
3,Santi Aldama,PF,MEM
4,LaMarcus Aldridge,C,BRK
...,...,...,...
807,Thaddeus Young,PF,TOR
808,Trae Young,PG,ATL
809,Omer Yurtseven,C,MIA
810,Cody Zeller,C,POR


Overview of data types of each columns in the dataframe

In [19]:
df.dtypes

Player     object
Pos        object
Age         int64
Tm         object
G           int64
GS          int64
MP        float64
FG        float64
FGA       float64
FG%       float64
3P        float64
3PA       float64
3P%       float64
2P        float64
2PA       float64
2P%       float64
eFG%      float64
FT        float64
FTA       float64
FT%       float64
ORB       float64
DRB       float64
TRB       float64
AST       float64
STL       float64
BLK       float64
TOV       float64
PF        float64
PTS       float64
dtype: object

In [20]:
df

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,Precious Achiuwa,C,22,TOR,73,28,23.6,3.6,8.3,0.439,...,0.595,2.0,4.5,6.5,1.1,0.5,0.6,1.2,2.1,9.1
1,Steven Adams,C,28,MEM,76,75,26.3,2.8,5.1,0.547,...,0.543,4.6,5.4,10.0,3.4,0.9,0.8,1.5,2.0,6.9
2,Bam Adebayo,C,24,MIA,56,56,32.6,7.3,13.0,0.557,...,0.753,2.4,7.6,10.1,3.4,1.4,0.8,2.6,3.1,19.1
3,Santi Aldama,PF,21,MEM,32,0,11.3,1.7,4.1,0.402,...,0.625,1.0,1.7,2.7,0.7,0.2,0.3,0.5,1.1,4.1
4,LaMarcus Aldridge,C,36,BRK,47,12,22.3,5.4,9.7,0.550,...,0.873,1.6,3.9,5.5,0.9,0.3,1.0,0.9,1.7,12.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
807,Thaddeus Young,PF,33,TOR,26,0,18.3,2.6,5.5,0.465,...,0.481,1.5,2.9,4.4,1.7,1.2,0.4,0.8,1.7,6.3
808,Trae Young,PG,23,ATL,76,76,34.9,9.4,20.3,0.460,...,0.904,0.7,3.1,3.7,9.7,0.9,0.1,4.0,1.7,28.4
809,Omer Yurtseven,C,23,MIA,56,12,12.6,2.3,4.4,0.526,...,0.623,1.5,3.7,5.3,0.9,0.3,0.4,0.7,1.5,5.3
810,Cody Zeller,C,29,POR,27,0,13.1,1.9,3.3,0.567,...,0.776,1.9,2.8,4.6,0.8,0.3,0.2,0.7,2.1,5.2


## **QUESTIONS**

### **Conditional Selection**

In performing exploratory data analysis, it is important to be able to select subsets of data to perform analysis or comparisons.

**Which player scored the most Points (PTS) Per Game?**
Here, we will return the entire row.

In [21]:
ans = df[df.PTS == df.PTS.max()]
ans

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
206,Joel Embiid,C,27,PHI,68,68,33.8,9.8,19.6,0.499,...,0.814,2.1,9.6,11.7,4.2,1.1,1.5,3.1,2.7,30.6


**We will return specific column values.**

**Further question, what team is the player from?**

In [22]:
ans.Tm

206    PHI
Name: Tm, dtype: object

**Which position is the player playing as?**

In [23]:
ans.Pos

206    C
Name: Pos, dtype: object

**How many games did the player played in the season?**

In [24]:
ans.G

206    68
Name: G, dtype: int64

**Which player scored more than 20 Points (PTS) Per Game?**

In [25]:
player = df[df.PTS > 20]
player

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
15,Giannis Antetokounmpo,PF,27,MIL,67,67,32.9,10.3,18.6,0.553,...,0.722,2.0,9.6,11.6,5.8,1.1,1.4,3.3,3.2,29.9
32,LaMelo Ball,PG,20,CHO,75,75,32.3,7.2,16.7,0.429,...,0.872,1.4,5.2,6.7,7.6,1.6,0.4,3.3,3.2,20.1
48,Bradley Beal,SG,28,WAS,40,40,36.0,8.7,19.3,0.451,...,0.833,1.0,3.8,4.7,6.6,0.9,0.4,3.4,2.4,23.2
70,Devin Booker,SG,25,PHO,68,68,34.5,9.7,20.9,0.466,...,0.868,0.7,4.4,5.0,4.8,1.1,0.4,2.4,2.6,26.8
78,Miles Bridges,PF,23,CHO,80,80,35.5,7.5,15.2,0.491,...,0.802,1.1,5.9,7.0,3.8,0.9,0.8,1.9,2.4,20.2
93,Jaylen Brown,SF,25,BOS,66,66,33.6,8.7,18.4,0.473,...,0.758,0.8,5.3,6.1,3.5,1.1,0.3,2.7,2.5,23.6
106,Jimmy Butler,SF,32,MIA,57,57,33.9,7.0,14.5,0.48,...,0.87,1.8,4.1,5.9,5.5,1.6,0.5,2.1,1.5,21.4
159,Stephen Curry,PG,33,GSW,64,64,34.5,8.4,19.1,0.437,...,0.923,0.5,4.7,5.2,6.3,1.3,0.4,3.2,2.0,25.5
160,Anthony Davis,C,28,LAL,40,40,35.1,9.3,17.4,0.532,...,0.713,2.7,7.2,9.9,3.1,1.2,2.3,2.1,2.4,23.2
167,DeMar DeRozan,PF,32,CHI,76,76,36.1,10.2,20.2,0.504,...,0.877,0.7,4.4,5.2,4.9,0.9,0.3,2.4,2.3,27.9


**Which player had the highest 3-Point Field Goals Per Game (3P) ?**

In [26]:
df[df["3P"] == df["3P"].max()]

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
159,Stephen Curry,PG,33,GSW,64,64,34.5,8.4,19.1,0.437,...,0.923,0.5,4.7,5.2,6.3,1.3,0.4,3.2,2.0,25.5


**Which player had the highest Assists Per Game (AST) ?**

In [27]:
df[df.AST == df.AST.max()]

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
582,Chris Paul,PG,36,PHO,65,65,32.9,5.6,11.3,0.493,...,0.837,0.3,4.0,4.4,10.8,1.9,0.3,2.4,2.1,14.7


### **GroupBy() function**

**Which player scored the highest (PTS) in the Los Angeles Lakers?**

In [28]:
df[df.Tm == "LAL"]

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
17,Carmelo Anthony,PF,37,LAL,69,3,26.0,4.6,10.5,0.441,...,0.83,0.9,3.3,4.2,1.0,0.7,0.8,0.9,2.4,13.3
21,Trevor Ariza,SF,36,LAL,24,11,19.3,1.4,4.1,0.333,...,0.556,0.4,3.0,3.4,1.1,0.5,0.3,0.5,0.8,4.0
24,D.J. Augustin,PG,34,LAL,21,0,17.8,1.9,4.1,0.453,...,1.0,0.2,1.1,1.3,1.6,0.3,0.0,0.5,1.0,5.3
46,Kent Bazemore,SF,32,LAL,39,14,14.0,1.2,3.6,0.324,...,0.765,0.3,1.4,1.8,0.9,0.6,0.2,0.5,1.8,3.4
74,Avery Bradley,SG,31,LAL,62,45,22.7,2.4,5.6,0.423,...,0.889,0.5,1.7,2.2,0.8,0.9,0.1,0.6,1.9,6.4
90,Chaundee Brown Jr.,SF,23,LAL,2,0,10.5,0.5,3.5,0.143,...,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.5,0.0,1.0
137,Darren Collison,PG,34,LAL,3,0,12.3,0.7,2.3,0.286,...,0.0,0.0,1.3,1.3,0.7,0.3,0.0,0.3,1.7,1.3
160,Anthony Davis,C,28,LAL,40,40,35.1,9.3,17.4,0.532,...,0.713,2.7,7.2,9.9,3.1,1.2,2.3,2.1,2.4,23.2
183,Sekou Doumbouya,PF,21,LAL,2,0,8.0,2.5,4.0,0.625,...,0.75,1.0,2.0,3.0,0.0,1.5,1.0,1.0,0.5,7.0
205,Wayne Ellington,SG,34,LAL,43,9,18.8,2.3,5.5,0.414,...,0.818,0.2,1.6,1.8,0.7,0.5,0.1,0.4,1.0,6.7


In [29]:
count = df[df.Tm == "LAL"]
count[count.PTS == count.PTS.max()] 

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
368,LeBron James,C,37,LAL,56,56,37.2,11.4,21.8,0.524,...,0.756,1.1,7.1,8.2,6.2,1.3,1.1,3.5,2.2,30.3


**Of the 5 positions, which position scores the most points?**

We first group players by their positions.

In [30]:
df.groupby('Pos').PTS.describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Pos,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
C,140.0,7.617143,5.703715,0.0,3.7,6.9,9.225,30.6
C-PF,1.0,7.0,,7.0,7.0,7.0,7.0,7.0
PF,150.0,8.146667,6.297381,0.0,3.325,6.7,10.575,29.9
PF-SF,1.0,3.1,,3.1,3.1,3.1,3.1,3.1
PG,155.0,8.474194,6.78208,0.0,3.4,6.3,11.85,28.4
PG-SG,1.0,8.3,,8.3,8.3,8.3,8.3,8.3
SF,153.0,6.75817,5.491688,0.0,2.8,5.8,9.1,26.9
SF-SG,4.0,10.05,6.499487,3.8,6.425,8.7,12.325,19.0
SG,200.0,8.036,6.29572,0.0,2.9,6.7,11.95,26.8
SG-PG,3.0,8.2,6.183041,4.0,4.65,5.3,10.3,15.3


We will now show only the 5 traditional positions (those having combo positions will be removed from the analysis).

In [38]:
positions = ['C','PF','SF','PG','SG']
POS = df[ df['Pos'].isin(positions)  ]
POS

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,Precious Achiuwa,C,22,TOR,73,28,23.6,3.6,8.3,0.439,...,0.595,2.0,4.5,6.5,1.1,0.5,0.6,1.2,2.1,9.1
1,Steven Adams,C,28,MEM,76,75,26.3,2.8,5.1,0.547,...,0.543,4.6,5.4,10.0,3.4,0.9,0.8,1.5,2.0,6.9
2,Bam Adebayo,C,24,MIA,56,56,32.6,7.3,13.0,0.557,...,0.753,2.4,7.6,10.1,3.4,1.4,0.8,2.6,3.1,19.1
3,Santi Aldama,PF,21,MEM,32,0,11.3,1.7,4.1,0.402,...,0.625,1.0,1.7,2.7,0.7,0.2,0.3,0.5,1.1,4.1
4,LaMarcus Aldridge,C,36,BRK,47,12,22.3,5.4,9.7,0.550,...,0.873,1.6,3.9,5.5,0.9,0.3,1.0,0.9,1.7,12.9
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
807,Thaddeus Young,PF,33,TOR,26,0,18.3,2.6,5.5,0.465,...,0.481,1.5,2.9,4.4,1.7,1.2,0.4,0.8,1.7,6.3
808,Trae Young,PG,23,ATL,76,76,34.9,9.4,20.3,0.460,...,0.904,0.7,3.1,3.7,9.7,0.9,0.1,4.0,1.7,28.4
809,Omer Yurtseven,C,23,MIA,56,12,12.6,2.3,4.4,0.526,...,0.623,1.5,3.7,5.3,0.9,0.3,0.4,0.7,1.5,5.3
810,Cody Zeller,C,29,POR,27,0,13.1,1.9,3.3,0.567,...,0.776,1.9,2.8,4.6,0.8,0.3,0.2,0.7,2.1,5.2


Now, let's take a look at the descriptive statistics.

In [39]:
POS.groupby('Pos').PTS.describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Pos,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
C,140.0,7.617143,5.703715,0.0,3.7,6.9,9.225,30.6
PF,150.0,8.146667,6.297381,0.0,3.325,6.7,10.575,29.9
PG,155.0,8.474194,6.78208,0.0,3.4,6.3,11.85,28.4
SF,153.0,6.75817,5.491688,0.0,2.8,5.8,9.1,26.9
SG,200.0,8.036,6.29572,0.0,2.9,6.7,11.95,26.8


### **Histograms**

We'll also try to answer this question by showing some histogram plots. So, to make it a bit easier, let's create a subset dataframe.

#### **pandas built-in visualization**

#### **Seaborn data visualization**

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

g = sns.FacetGrid(PTS, col="Pos")
g.map(plt.hist, "PTS");

### **Box plots**

#### **Box plot of points scored (PTS) grouped by Position**

##### **pandas built-in visualization**

##### **Seaborn data visualization**

In [None]:
import seaborn as sns

sns.boxplot(x = 'Pos', y = 'PTS', data = PTS) 

In [None]:
sns.boxplot(x = 'Pos', y = 'PTS', data = PTS) 
sns.stripplot(x = 'Pos', y = 'PTS', data = PTS,
              jitter=True, 
              marker='o',
              alpha=0.8, 
              color="black")

### **Heat map**

#### Compute the correlation matrix

#### Make the heat map

#### Adjust figure size of heat map

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(7,5))
sns.heatmap(corr, square=True)

#### Mask diagonal half of heat map (Diagonal correlation matrix)

In [None]:
# https://seaborn.pydata.org/generated/seaborn.heatmap.html

import numpy as np
import seaborn as sns

mask = np.zeros_like(corr)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    f, ax = plt.subplots(figsize=(7, 5))
    ax = sns.heatmap(corr, mask=mask, vmax=1, square=True)

### **Scatter Plot**

In [None]:
df

#### Select columns if they have numerical data types

#### Select the first 5 columns (by index number)

#### Select 5 specific columns (by column names)

In [None]:
selections = ['Age', 'G', 'STL', 'BLK', 'AST', 'PTS']

#### Make scatter plot grid

##### 5 columns

##### All columns