<a href="https://colab.research.google.com/github/GoAshim/EDA/blob/main/EDA_1_NBA_Player_Stats_for_Entire_Career.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Perform EDA on the entire career statistics of all NBA Players played in 2021-22 season.
In this Exploratory Data Analysis (EDA) exercise we will work on the dataframe we created during previous web scraping assignment (link [here](https://github.com/GoAshim/WebScraping/blob/27939865ea33f8c5c1d43bd1941662de9908b8a3/Web_Scraping_3_NBA_Player_Stats_for_Entire_Career.ipynb)). We will analyze the dataframe and extract important information / statistics of the entire career of the players who played in NBA 2021-22 season.

## Step 1 - Import required libraries

In [1]:
import pandas as pd 

## Step 2 - Extract the dataset from the file and process the data in order to perform the Exploratory Data Analysis (EDA)  

### Step 2.1 - Extract the content of the CSV data file and load to a dataframe

In [19]:
# Location of the CSV file in my Google Drive
url = "/content/drive/MyDrive/DataFiles/NBAPlayerStat.csv"

df1 = pd.read_csv(url)

df1.head()

Unnamed: 0.1,Unnamed: 0,C1,C2,C3,C4,C5,C6,C7,C8,C9,...,C24,C25,C26,C27,C28,C29,C30,C31,C32,C33
0,0,1,Precious Achiuwa,Regular,2020-21,21,MIA,NBA,PF,61,...,0.509,1.2,2.2,3.4,0.5,0.3,0.5,0.7,1.5,5.0
1,1,1,Precious Achiuwa,Regular,2021-22,22,TOR,NBA,C,73,...,0.595,2.0,4.5,6.5,1.1,0.5,0.6,1.2,2.1,9.1
2,2,1,Precious Achiuwa,Playoff,2020-21,21,MIA,NBA,PF,3,...,0.25,0.0,2.0,2.0,0.0,0.0,0.7,1.3,0.3,2.3
3,3,1,Precious Achiuwa,Playoff,2021-22,22,TOR,NBA,C,6,...,0.6,1.3,3.5,4.8,1.0,0.2,0.8,1.5,2.3,10.2
4,0,2,Steven Adams,Regular,2013-14,20,OKC,NBA,C,81,...,0.581,1.8,2.3,4.1,0.5,0.5,0.7,0.9,2.5,3.3


### Step 2.2 - Remove the first column from the dataframe as that isn't necessary

In [20]:
# We see that the index of the dataframe was extracted to the CSV file. Let's remove that column as we won't need that for our analysis

df1.drop(columns = ['Unnamed: 0'], inplace = True)
df1.columns

Index(['C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8', 'C9', 'C10', 'C11',
       'C12', 'C13', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21',
       'C22', 'C23', 'C24', 'C25', 'C26', 'C27', 'C28', 'C29', 'C30', 'C31',
       'C32', 'C33'],
      dtype='object')

### Step 2.3 - Rename columns of the dataframe.

In [21]:
# We see that the column names C1 to C33 are not meaningful, so let's change those column names as they are described in the Basketball-reference site

# First create a dictionary with the old and the corresponding new column names
col_dict = {
    'C1' : 'player_id', 
    'C2' : 'player_name', 
    'C3' : 'season_type', 
    'C4' : 'season', 
    'C5' : 'age', 
    'C6' : 'team', 
    'C7' : 'league', 
    'C8' : 'position', 
    'C9' : 'games_played', 
    'C10' : 'games_started', 
    'C11' : 'minutes_per_game',
    'C12' : 'field_goals_per_game', 
    'C13' : 'field_goals_attempts_per_game', 
    'C14' : 'field_goal_percentage', 
    'C15' : '3points_per_game', 
    'C16' : '3points_attempts_per_game', 
    'C17' : '3points_percentage', 
    'C18' : '2points_per_game', 
    'C19' : '2points_attempts_per_game', 
    'C20' : '2points_percentage', 
    'C21' : 'effective_field_goal_percentage',
    'C22' : 'free_throw_per_game', 
    'C23' : 'free_throw_attempts_per_game', 
    'C24' : 'free_throw_percentage', 
    'C25' : 'offensive_rebound_per_game', 
    'C26' : 'defensive_rebound_per_game', 
    'C27' : 'total_rebound_per_game', 
    'C28' : 'assists_per_game', 
    'C29' : 'steals_per_game', 
    'C30' : 'blocks_per_game', 
    'C31' : 'turnover_per_game',
    'C32' : 'foul_per_game', 
    'C33' : 'points_per_game'
}

df1.rename(columns = col_dict, inplace = True)

df1.columns

Index(['player_id', 'player_name', 'season_type', 'season', 'age', 'team',
       'league', 'position', 'games_played', 'games_started',
       'minutes_per_game', 'field_goals_per_game',
       'field_goals_attempts_per_game', 'field_goal_percentage',
       '3points_per_game', '3points_attempts_per_game', '3points_percentage',
       '2points_per_game', '2points_attempts_per_game', '2points_percentage',
       'effective_field_goal_percentage', 'free_throw_per_game',
       'free_throw_attempts_per_game', 'free_throw_percentage',
       'offensive_rebound_per_game', 'defensive_rebound_per_game',
       'total_rebound_per_game', 'assists_per_game', 'steals_per_game',
       'blocks_per_game', 'turnover_per_game', 'foul_per_game',
       'points_per_game'],
      dtype='object')

### Step 2.4 - Remove all derived columns from the dataframe.

In [23]:
# We are going to remove the following derived columns, as we already have their underlying data
# 'field_goals_per_game', 'field_goals_attempts_per_game', 'field_goal_percentage', '3points_percentage', '2points_percentage', 
# 'free_throw_percentage', 'total_rebound_per_game'

# Method 1
drop_cols = [11, 12, 13, 16, 19, 20, 23, 26]
df1.drop(df1.columns[drop_cols], axis= 1, inplace= True)
df1.columns

# Method 2
#drop_cols = ['field_goals_per_game', 'field_goals_attempts_per_game', 'field_goal_percentage', '3points_percentage', '2points_percentage', 'effective_field_goal_percentage', 'free_throw_percentage', 'total_rebound_per_game']
#df1.drop(columns= drop_cols, inplace= True)
#df1.columns

# I prefer menthod 1 above because using the column index is easier than to use the column names, which can be long

Index(['player_id', 'player_name', 'season_type', 'season', 'age', 'team',
       'league', 'position', 'games_played', 'games_started',
       'minutes_per_game', '3points_per_game', '3points_attempts_per_game',
       '2points_per_game', '2points_attempts_per_game', 'free_throw_per_game',
       'free_throw_attempts_per_game', 'offensive_rebound_per_game',
       'defensive_rebound_per_game', 'assists_per_game', 'steals_per_game',
       'blocks_per_game', 'turnover_per_game', 'foul_per_game',
       'points_per_game'],
      dtype='object')

### Step 2.5 - Create new columns to the dataframe to calculate statics per season based on the per game statistics.

In [27]:
# We call df1.info() to check if the existing columns of the dataframe we need to use for the calculation have the right datatype

df1['minutes'] = df1['minutes_per_game'] * df1['games_played']
df1['3points'] = df1['3points_per_game'] * df1['games_played']
df1['3points_attempts'] = df1['3points_attempts_per_game'] * df1['games_played']
df1['2points'] = df1['2points_per_game'] * df1['games_played']
df1['2points_attempts'] = df1['2points_attempts_per_game'] * df1['games_played']
df1['free_throws'] = df1['free_throw_per_game'] * df1['games_played']
df1['free_throw_attempts'] = df1['free_throw_attempts_per_game'] * df1['games_played']
df1['offensive_rebound'] = df1['offensive_rebound_per_game'] * df1['games_played']
df1['defensive_rebound'] = df1['defensive_rebound_per_game'] * df1['games_played']
df1['assists'] = df1['assists_per_game'] * df1['games_played']
df1['steals'] = df1['steals_per_game'] * df1['games_played']
df1['blocks'] = df1['blocks_per_game'] * df1['games_played']
df1['turnovers'] = df1['turnover_per_game'] * df1['games_played']
df1['fouls'] = df1['foul_per_game'] * df1['games_played']
df1['points'] = df1['points_per_game'] * df1['games_played']

df1.columns

Index(['player_id', 'player_name', 'season_type', 'season', 'age', 'team',
       'league', 'position', 'games_played', 'games_started',
       'minutes_per_game', '3points_per_game', '3points_attempts_per_game',
       '2points_per_game', '2points_attempts_per_game', 'free_throw_per_game',
       'free_throw_attempts_per_game', 'offensive_rebound_per_game',
       'defensive_rebound_per_game', 'assists_per_game', 'steals_per_game',
       'blocks_per_game', 'turnover_per_game', 'foul_per_game',
       'points_per_game', 'minutes', '3points', '3points_attempts', '2points',
       '2points_attempts', 'free_throws', 'free_throw_attempts',
       'offensive_rebound', 'defensive_rebound', 'assists', 'steals', 'blocks',
       'turnovers', 'fouls', 'points'],
      dtype='object')

### Step 2.6 - Remove all columns containing per game statistics.

In [30]:
# We use the method 1 we used earlier, here to drop from 10th column all the way upto the 25th column
 
df1.drop(df1.columns[10:25], axis= 1, inplace= True)
df1.columns

Index(['player_id', 'player_name', 'season_type', 'season', 'age', 'team',
       'league', 'position', 'games_played', 'games_started', 'minutes',
       '3points', '3points_attempts', '2points', '2points_attempts',
       'free_throws', 'free_throw_attempts', 'offensive_rebound',
       'defensive_rebound', 'assists', 'steals', 'blocks', 'turnovers',
       'fouls', 'points'],
      dtype='object')

### Step 2.7 - Review the first few records in the dataframe to make sure everything looks alright. Then check how many records have null value in which columns

In [34]:
#df1. info() # This shows there are 7637 records in the dataframe and 25 columns, and none of the columns have null / NaN value in any record

df1.head()

Unnamed: 0,player_id,player_name,season_type,season,age,team,league,position,games_played,games_started,...,free_throws,free_throw_attempts,offensive_rebound,defensive_rebound,assists,steals,blocks,turnovers,fouls,points
0,1,Precious Achiuwa,Regular,2020-21,21,MIA,NBA,PF,61,4,...,54.9,109.8,73.2,134.2,30.5,18.3,30.5,42.7,91.5,305.0
1,1,Precious Achiuwa,Regular,2021-22,22,TOR,NBA,C,73,28,...,80.3,131.4,146.0,328.5,80.3,36.5,43.8,87.6,153.3,664.3
2,1,Precious Achiuwa,Playoff,2020-21,21,MIA,NBA,PF,3,0,...,0.9,3.9,0.0,6.0,0.0,0.0,2.1,3.9,0.9,6.9
3,1,Precious Achiuwa,Playoff,2021-22,22,TOR,NBA,C,6,1,...,6.0,10.2,7.8,21.0,6.0,1.2,4.8,9.0,13.8,61.2
4,2,Steven Adams,Regular,2013-14,20,OKC,NBA,C,81,20,...,81.0,137.7,145.8,186.3,40.5,40.5,56.7,72.9,202.5,267.3


### Step 9 - By printing the name of each player in the above block, we found that the code is getting error for the player name 'Thanasis Antetokounmpo'. Upon manually checking his individual career page we found that he did not play for certain years and hence have no data for those years. To accomodate that issue, let's modify the function we created to fetch the career statistics of each player.

### Step 10 - Repeat the step 8 above to collect the entire career statistics of every player played in the 2021-22 season and store that into a dataframe. Only this time we will call the updated function we created in step 9 above..

### Step 11 - The above code took about 4 minute 30 seconds to collect the entire career statistics of each player played in the 2021-22 season and store that to a dataframe. Now let's inspect the dataframe and then save it to a CSV file for future analysis.