# 1. Data Exploration

## 1. Import Libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## 2. Load The Data Set

In [None]:
df = pd.read_excel("data/NBA_DATA_2022-2023_2017-2018.xlsx")

## 3. View My Columns

In [None]:
df.columns

Column Description:
- Tm: Team
- G: Games Played
- GS: Games Started
- MP: Minutes Played
- FG: Field Goals Made (2 Pointers + 3 Pointers)
- FGA: Field Goals Attempted 
- FG%: Field Goal Percentage (FG/FGA)
- 3P: 3 Pointers Made 
- 3PA: 3 Pointers Attempted
- 3P%: 3 Pointer Percentage (3P/3PA)
- 2P: 2 Pointers Made 
- 2PA: 2 Pointers Attempted
- 2P%: 2 Pointer Percentage (2P/2PA)
- eFG%: Effective Field Goal Percentage (Meaning that the percentage takes into account that 3 pointers are worth 3 points while 2 pointers are less valuable)
- FT: Free Throws Made 
- FTA: Free Throws Attempted 
- FT%: Free Throw Percentage (FT/FTA)
- ORB: Rebounds While on Offense
- DRB: Rebounds While on Defense 
- TRB: Total Rebounds (ORB + DRB)
- AST: Assits 
- STL: Steals 
- BLK: Blocks 
- TOV: Turnovers
- PF: Personal Fouls 
- PTS: Points Scored
- Time: 2022-23 or 2017-18




## 4. Check For Unique, Null and Duplicated Values

In [None]:
# Find the null values for each column
df.isna().sum()

# Find the unique values for these two as they have null values I am loking to replace
df['birth_city'].unique()
df['birth_state'].unique()

# Check For Duplicates only for Players to make sure a player is not included twice
df.duplicated(['NAME']).sum()
df.loc[df.duplicated('NAME', keep=False)].sort_values(['Time'])


- When taking a look at the different null values, the null values fall under only percentage based statistics, colleges, birth_cities, and birth_states. For percentage based statistics, a good assumption would be that it comes from trying to avoid a ZeroDivisionError: If a player was to take no shots, then the divisor will be zero which would result in a ZeroDivisionError.

- Though birth state and city are 5% null, those are easily findable values that I can change

- For College, it is over 10% null so I will drop that column 

- I did not bother looking for duplicates in this set as players can have the same stats or be from the same city 

# 2. Data Cleaning 

## 1. Drop Unecessary Columns, Query Data For Specific Range, and Drop Duplicates

In [None]:
# Drop the College Column
df = df.drop("collage", axis=1)
df.columns

# Query the data so I only keep the players in 2022-2023
df1 = df.query("Time == '2022-2023'")

# Drop the Duplicates so that there are no repeat names in the data set  
df1 = df1.drop_duplicates('NAME', keep=False)

# Now that Duplicates are dropped and data is queryed, check for null values again
df1.isna().sum()


- Removing the duplicates and old players did me a huge favor since most of the null values came in older seasons of players that we did not need

- We want to be as relevant to today as possible, so I only kept players from last years 2022-2023 season

- Now I just need to replace five player's birth city and state with unknown, and replace the one players three point percentage with 0

In [154]:
def organize_status(word):
    for key in df1['3P%']:
        if word in df1['3P%'][key]:
            return 0
    return word

df['3P%'] = df['3P%'].map(organize_status)

df1['3P%'].unique()


KeyError: 0.427

# 3. Exploratory Data Analysis

In [None]:
plt.title("Salary vs FG")

sns.scatterplot(data=df, x='SALARY', y='FG')