# Nba Regular Season 2018-19 Data Challenge

Your task will be to take the dataset given, and create an analysis answering the following 10 questions. This project will again test your knowledge of pandas in order to find the answers needed given the data you are presented with.

# What was the average age of player in the league?

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('files/nbastats2018-2019.csv')
df.columns

Index(['Name', 'Height', 'Weight', 'Team', 'Age', 'Salary', 'Points', 'Blocks',
       'Steals', 'Assists', 'Rebounds', 'FT%', 'FTA', 'FG3%', 'FG3A', 'FG%',
       'FGA', 'MP', 'G', 'PER', 'OWS', 'DWS', 'WS', 'WS48', 'USG', 'BPM',
       'VORP'],
      dtype='object')

In [36]:
avg_age = df['Age'].mean()

print(f"The average age of players in this dataset is: {round(avg_age,2)} years old.")


The average age of players in this dataset is: 25.9 years old.


# What player scored the most points?

In [29]:
#Two different approaches to this.
#1. Most points as it is laid out in the dataset by default, which appears to be the avg PPG
#In this case, James Harden would have the highest points per game average of 36.1
highest_avg_points = df[['Name','Points']].sort_values('Points', ascending=False).head(1)
print(f"{highest_avg_points.iloc[0]['Name']} had the highest points per game average in this span at: {highest_avg_points.iloc[0]['Points']} per game.")

#2. While it is unlikely to be different, the highest total points for the season could be someone else, in a situation someone only play 1 game and scored a ton of points, they'd technically be the top player with answer #1
#To address this, I create a new column 'Total Points' which is equal to the Points (points per game avg) * # of games (giving total points)
df['Total Points'] = df['Points'] * df['G']
#In the end, the same person had the highest total points for the season (James Harden)
highest_total_points = df[['Name', 'Total Points']].sort_values('Total Points', ascending=False).head(1)
print(f"The same player {highest_total_points.iloc[0]['Name']} had the highest total points for the season with: {highest_total_points.iloc[0]['Total Points']} (decimal value as it is determine from AVG Per Game x # of Games)")



James Harden had the highest points per game average in this span at: 36.1 per game.
The same player James Harden had the highest total points for the season with: 2815.8 (decimal value as it is determine from AVG Per Game x # of Games)


Unnamed: 0,Name,Total Points
202,James Harden,2815.8


# What player had the most blocks during the season? Was it a post player (F/C)?

In [34]:
#After finding out the total blocks (blocks per game x # of games), I realized there doesn't seem to be a column for position played
#The top blocker was Myles Turner, who does indeed play post/C
df['Total Blocks'] = df['Blocks'] * df['G']
df[['Name','Blocks','Total Blocks']].sort_values('Total Blocks', ascending=False)

print('The player with the most blocks during the season was Myles Turner, who is a post player at the C position.')

The player with the most blocks during the season was Myles Turner, who is a post player at the C position.


# Based on the regular season, who had the best chance to win a title given their win percentage?

In [46]:
df[['Name','G','WS48']].sort_values('WS48', ascending=False)
#going off of the Winshare Statistic (and the specific WS48 part of that), we find that the players who had the best chance to win a title was:
#Zhou Qi


Unnamed: 0,Name,G,WS48
405,Zhou Qi,1,1.261
147,Trevon Duval,3,0.537
394,Gary Payton II,3,0.358
95,Troy Caupain,4,0.347
501,Alan Williams,5,0.312
...,...,...,...
498,Okaro White,3,-0.251
242,Andre Ingram,4,-0.408
14,Ike Anigbogu,3,-0.480
268,Terrence Jones,2,-0.661


# What player had the best 3-pt percentage? 

In [20]:
#While there are a number of players who had 100% 3-pt percentage, I feel like they are outliers/have an extremely small sample size - either a very small # of shots or a very small # of minutes played
#As a result, I filtered them out and found the highest 3-pt percentage once that data was 'cleaned'
#For this filtering purpose and in anticipation of the next question, I created a new column for 'Total Minutes Played', which is the result of games played * minutes per game
df['Total Minutes Played'] = df['MP'] * df['G']
df['Total Minutes Played'].mean()
#Then I applied a rough filter to get rid of the 100% shooters and those who played less than 400 minutes, which could be refined but I felt this was a good 'floor' to set, with the avg/mean minutes played being 1139 minutes
best_3pt_player = df[(df['FG3%'] != 1.00) & (df['Total Minutes Played'] > 400)][['Name','FG3%']].sort_values('FG3%', ascending=False).head(1)
#the finding was Domantas Sabonis being the top 3-pt percentage player
print(f"After filtering the data, I found the player with the best 3-pt percentage was {best_3pt_player.iloc[0]['Name']} who had a 3pt percentage of: {round(best_3pt_player.iloc[0]['FG3%']*100,2)}%")


After filtering the data, I found the player with the best 3-pt percentage was Domantas Sabonis who had a 3pt percentage of: 52.9%


# Who played the most minutes during the season

In [23]:
#This was fairly simple, just re-used the 'Total Minutes Played' column/variable I created in the last problem, then sorted values.
df['Total Minutes Played'] = df['MP'] * df['G']
highest_minute_player = df[['Name','MP','G','Total Minutes Played']].sort_values('Total Minutes Played', ascending=False).head(1)
print(f"The player that played the most minutes during the season was: {highest_minute_player.iloc[0]['Name']} who played a total of {round(highest_minute_player.iloc[0]['Total Minutes Played'])} minutes during the season.")


The player that played the most minutes during the season was: Bradley Beal who played a total of 3026 minutes during the season.


# What player given their player effiecency rating was the clutchest during the season?

In [37]:
#Like the earlier question, I feel like a filter is needed - as the raw data gives the top PER player - Zhou Qi, who played 1 total minute in this season
best_per_player = df[df['Total Minutes Played'] > 400][['Name', 'PER']].sort_values('PER', ascending=False).head(1)

print(f"After filtering the data, I found the most efficient player (via PER ranking) was {best_per_player.iloc[0]['Name']} who has a PER of: {best_per_player.iloc[0]['PER']}")

After filtering the data, I found the most efficient player (via PER ranking) was Giannis Antetokounmpo who has a PER of: 30.9


# What team had the youngest roster?

In [48]:
#Only thing different in this compared to other grouping/sorting, is we're looking for youngest/lowest value - so we can set 'ascending=True'
youngest_team = df[['Team','Age']].groupby('Team', as_index=False).mean().sort_values('Age', ascending=True).iloc[0]
print(f"The team with the youngest roster for this season was the {youngest_team['Team']} with an average age of {youngest_team['Age']} years.")

The team with the youngest roster for this season was the Chicago Bulls with an average age of 24.3125 years.


# Who is the highest paid player during the seasion?

In [70]:
salary_cleaned_df = df[df['Salary'] != '-']
#this one was tricky, as it appears the 'Salary' field was being treated as a string, so at first the sorting was returning incorrect data
#using a simple swap to integer, via .astype(int) resolved this.
salary_cleaned_df['Salary'] = salary_cleaned_df['Salary'].astype(int)
richest_player = salary_cleaned_df[['Name','Salary']].sort_values('Salary', ascending=False).head(1)

print(f"The highest paid player during the season was {richest_player.iloc[0]['Name']} who earned ${richest_player.iloc[0]['Salary']}")

The highest paid player during the season was Stephen Curry who earned $37457154


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  salary_cleaned_df['Salary'] = salary_cleaned_df['Salary'].astype(int)


# At the end of a game, who WOULDN'T you want on the Free Throw Line?

In [84]:
#Applied the same filter to weed out people who only played VERY small amounts of time
best_ft_shooter = df[df['Total Minutes Played'] > 400][['Name','FT%']].sort_values('FT%', ascending=False).head(1)
print(f"The best free throw shooter purely by percentage was {best_ft_shooter.iloc[0]['Name']} who had a FT% of: {best_ft_shooter.iloc[0]['FT%'] *100}%")

The best free throw shooter purely by percentage was Gary Clark who had a FT% of: 100.0%
