<a href="https://colab.research.google.com/github/DevEnriquegd/mvp-nba/blob/main/mvp_nba.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **NBA Player Stats ‚Äì Who‚Äôs the Real MVP?**

You are a Data Analyst for a sports media company. You‚Äôve been given a dataset covering NBA players across many seasons with information on age, height, weight, draft position, and advanced stats (points, rebounds, assists, usage %, true shooting %, etc.).

Your task is to dig into this dataset to uncover trends in player performance, evaluate which metrics really define ‚Äúgreatness,‚Äù and nominate a ‚ÄúData MVP‚Äù for a given season or all time.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display

url = 'https://raw.githubusercontent.com/DevEnriquegd/mvp-nba/7d874f9284eb1609c8e2eb08c243619fc560012b/all_seasons.csv'

data = pd.read_csv(url, index_col=0 )

data.head()

Unnamed: 0,player_name,team_abbreviation,age,player_height,player_weight,college,country,draft_year,draft_round,draft_number,...,pts,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct,season
0,Randy Livingston,HOU,22.0,193.04,94.800728,Louisiana State,USA,1996,2,42,...,3.9,1.5,2.4,0.3,0.042,0.071,0.169,0.487,0.248,1996-97
1,Gaylon Nickerson,WAS,28.0,190.5,86.18248,Northwestern Oklahoma,USA,1994,2,34,...,3.8,1.3,0.3,8.9,0.03,0.111,0.174,0.497,0.043,1996-97
2,George Lynch,VAN,26.0,203.2,103.418976,North Carolina,USA,1993,1,12,...,8.3,6.4,1.9,-8.2,0.106,0.185,0.175,0.512,0.125,1996-97
3,George McCloud,LAL,30.0,203.2,102.0582,Florida State,USA,1989,1,7,...,10.2,2.8,1.7,-2.7,0.027,0.111,0.206,0.527,0.125,1996-97
4,George Zidek,DEN,23.0,213.36,119.748288,UCLA,USA,1995,1,22,...,2.8,1.7,0.3,-14.1,0.102,0.169,0.195,0.5,0.064,1996-97


In [2]:
# Data exploration - Basic Data

data[['player_name', 'age', 'player_height', 'player_weight', 'country', 'college']].sample(5)

Unnamed: 0,player_name,age,player_height,player_weight,country,college
2196,Samaki Walker,25.0,205.74,117.93392,USA,Louisville
9382,Andrew Harrison,22.0,198.12,96.615096,USA,Kentucky
8159,Reggie Evans,35.0,203.2,111.13004,USA,Iowa
4919,Darrell Armstrong,40.0,185.42,81.64656,USA,Fayetteville State
3404,Luke Walton,24.0,203.2,106.59412,USA,Arizona


In [3]:
# Data exploration - Team and Draft

data[['player_name', 'team_abbreviation', 'season', 'draft_year', 'draft_round', 'draft_number']].sample(5)

Unnamed: 0,player_name,team_abbreviation,season,draft_year,draft_round,draft_number
2670,Mike Miller,MEM,2002-03,2000,1,5
5297,Pat Garrity,ORL,2007-08,1998,1,19
5596,Trenton Hassell,NJN,2008-09,2001,2,29
815,Jeff Hornacek,UTA,1997-98,1986,2,46
1557,Cedric Ceballos,DAL,1999-00,1990,2,48


In [4]:
# Data exploration - Statistics Accumulated per Season

data[['player_name', 'season', 'gp', 'pts', 'reb', 'ast']].sample(5)

Unnamed: 0,player_name,season,gp,pts,reb,ast
8222,Matthew Dellavedova,2014-15,67,4.8,1.9,3.0
2872,Bryon Russell,2002-03,70,4.5,3.0,1.0
8783,D.J. Augustin,2015-16,62,7.5,1.5,3.2
1828,Hubert Davis,2000-01,66,7.9,2.1,1.7
3578,Kerry Kittles,2004-05,11,6.3,2.9,1.8


In [5]:
# Data exploration - Statistics Accumulated per Season

data[['player_name', 'season', 'net_rating', 'oreb_pct', 'dreb_pct', 'usg_pct', 'ts_pct', 'ast_pct']].sample(5)

Unnamed: 0,player_name,season,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct
8174,Marcin Gortat,2014-15,6.0,0.084,0.237,0.175,0.587,0.062
5462,Darrell Arthur,2008-09,-7.3,0.088,0.203,0.163,0.456,0.049
4313,Aaron McKie,2005-06,-6.2,0.03,0.168,0.06,0.272,0.147
5490,Grant Hill,2008-09,4.4,0.033,0.155,0.176,0.584,0.107
10025,Jerian Grant,2017-18,-7.1,0.017,0.084,0.172,0.528,0.314


In [6]:
# Data cleaning

data['college'].fillna('No College', inplace=True)

data.isnull().sum()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['college'].fillna('No College', inplace=True)


Unnamed: 0,0
player_name,0
team_abbreviation,0
age,0
player_height,0
player_weight,0
college,0
country,0
draft_year,0
draft_round,0
draft_number,0


## **Which players lead their seasons in scoring, rebounding, and playmaking - and how efficient are they?**

In [7]:
def find_leader_by_statistic(df, statistic):
    """
    Find the leader in a given statistic for each season in a DataFrame.
    """
    idx = df.groupby('season')[statistic].idxmax()
    return df.loc[idx]

report_cols = ['player_name', 'season', 'pts', 'reb', 'ast', 'net_rating', 'usg_pct', 'ts_pct', 'ast_pct']

In [8]:
leader_pts = find_leader_by_statistic(data, 'pts')

report_pts = leader_pts[report_cols].rename(columns={
    'pts': 'Leader_PTS',
    'reb': 'Reb_Context',
    'ast': 'Ast_Context'
})

efficient_scorers_report = report_pts.sort_values('ts_pct', ascending=False)

TS_ELITE_THRESHOLD = 0.60
TS_RISK_THRESHOLD = 0.53

def highlight_efficiency(val):
    """Assigns color based on the TS% value."""
    if isinstance(val, (int, float)):
        if val > TS_ELITE_THRESHOLD:
            return 'background-color: lightgreen'
        elif val < TS_RISK_THRESHOLD:
            return 'background-color: lightcoral'
    return ''

summary_cols = ['player_name', 'season', 'Leader_PTS', 'usg_pct', 'ts_pct', 'net_rating']

styled_report = efficient_scorers_report[summary_cols].style.format({
    'usg_pct': '{:.1%}',
    'ts_pct': '{:.1%}',
    'net_rating': '{:+.1f}'
}).map(
    highlight_efficiency,
    subset=['ts_pct']
)

print("üèÄ SCORING LEADERS: RANKED BY TRUE SHOOTING EFFICIENCY (TS%)")

display(styled_report)

üèÄ SCORING LEADERS: RANKED BY TRUE SHOOTING EFFICIENCY (TS%)


Unnamed: 0,player_name,season,Leader_PTS,usg_pct,ts_pct,net_rating
8930,Stephen Curry,2015-16,30.1,32.0%,66.9%,18.3
11537,Stephen Curry,2020-21,32.0,33.1%,65.5%,4.6
12839,Joel Embiid,2022-23,33.1,37.0%,65.5%,8.8
8013,Kevin Durant,2013-14,32.0,32.7%,63.5%,8.0
10634,James Harden,2019-20,34.3,35.6%,62.6%,5.8
9996,James Harden,2017-18,30.4,35.3%,61.9%,10.0
10227,James Harden,2018-19,36.1,39.6%,61.6%,6.3
12203,Joel Embiid,2021-22,30.6,37.5%,61.6%,7.9
6786,Kevin Durant,2011-12,28.0,30.8%,61.0%,7.7
6183,Kevin Durant,2009-10,30.1,31.7%,60.7%,7.0


In [9]:
leader_reb = find_leader_by_statistic(data, 'reb')

report_reb = leader_reb[report_cols].rename(columns={
    'reb': 'Leader_REB',
    'pts': 'Pts_Context',
    'ast': 'Ast_Context'
})

impact_rebounders_report = report_reb.sort_values('net_rating', ascending=False)
NET_RATING_ELITE_THRESHOLD = 8.0
NET_RATING_RISK_THRESHOLD = 0.0

def highlight_impact(val):
    """Assigns color based on the NET_RATING value."""
    if isinstance(val, (int, float)):
        if val >= NET_RATING_ELITE_THRESHOLD:
            return 'background-color: lightgreen'
        elif val < NET_RATING_RISK_THRESHOLD:
            return 'background-color: lightcoral'
    return ''

summary_cols_reb = ['player_name', 'season', 'Leader_REB', 'net_rating', 'usg_pct', 'ts_pct']

styled_report_reb = impact_rebounders_report[summary_cols_reb].style.format({
    'Leader_REB': '{:.1f}',
    'usg_pct': '{:.1%}',
    'ts_pct': '{:.1%}',
    'net_rating': '{:+.1f}'
}).map(
    highlight_impact,
    subset=['net_rating']
)

print("üß± REBOUNDING LEADERS: RANKED BY NET RATING (Winning Impact)")

display(styled_report_reb)

üß± REBOUNDING LEADERS: RANKED BY NET RATING (Winning Impact)


Unnamed: 0,player_name,season,Leader_REB,net_rating,usg_pct,ts_pct
188,Dennis Rodman,1996-97,16.1,16.1,10.0%,47.9%
5971,Dwight Howard,2009-10,13.2,11.8,24.1%,63.0%
8446,DeAndre Jordan,2014-15,15.0,11.2,13.5%,63.8%
5445,Dwight Howard,2008-09,13.8,10.6,26.0%,60.0%
3434,Kevin Garnett,2003-04,13.9,10.4,29.4%,54.7%
12212,Rudy Gobert,2021-22,14.7,9.6,16.6%,73.2%
7824,DeAndre Jordan,2013-14,13.6,9.2,12.3%,63.0%
4918,Dwight Howard,2007-08,14.2,8.1,24.0%,61.9%
760,Dennis Rodman,1997-98,15.0,6.7,8.8%,45.9%
11411,Clint Capela,2020-21,14.3,6.6,19.3%,60.1%


In [10]:
leader_ast = find_leader_by_statistic(data, 'ast')

report_ast = leader_ast[report_cols].rename(columns={
    'ast': 'Leader_AST',
    'pts': 'Pts_Context',
    'reb': 'Reb_Context'
})

creation_leaders_report = report_ast.sort_values('ast_pct', ascending=False)

AST_ELITE_THRESHOLD = 0.45
AST_RISK_THRESHOLD = 0.30

def highlight_creation(val):
    """Assigns color based on the AST% value."""
    if isinstance(val, (int, float)):
        if val >= AST_ELITE_THRESHOLD:
            return 'background-color: lightgreen'
        elif val < AST_RISK_THRESHOLD:
            return 'background-color: lightcoral'
    return ''

summary_cols_ast = ['player_name', 'season', 'Leader_AST', 'ast_pct', 'usg_pct', 'Pts_Context', 'net_rating']

styled_report_ast = creation_leaders_report[summary_cols_ast].style.format({
    'Leader_AST': '{:.1f}',
    'ast_pct': '{:.1%}',
    'usg_pct': '{:.1%}',
    'Pts_Context': '{:.1f}',
    'net_rating': '{:+.1f}'
}).map(
    highlight_creation,
    subset=['ast_pct']
)

print("üß† ASSIST LEADERS: RANKED BY ASSIST PERCENTAGE (Playmaking Volume)")

display(styled_report_ast)

üß† ASSIST LEADERS: RANKED BY ASSIST PERCENTAGE (Playmaking Volume)


Unnamed: 0,player_name,season,Leader_AST,ast_pct,usg_pct,Pts_Context,net_rating
5732,Chris Paul,2008-09,11.0,51.2%,27.3%,22.8,6.5
9457,James Harden,2016-17,11.2,50.5%,34.1%,29.1,6.3
5060,Chris Paul,2007-08,11.6,50.0%,25.4%,21.1,7.9
6445,Steve Nash,2010-11,11.4,49.8%,21.2%,14.7,4.5
7062,Rajon Rondo,2011-12,11.7,49.8%,20.3%,11.9,5.2
7227,Rajon Rondo,2012-13,11.1,49.0%,21.5%,13.7,-1.3
2600,Andre Miller,2001-02,10.9,48.4%,22.7%,16.5,-1.1
5872,Steve Nash,2009-10,11.0,48.3%,22.8%,16.5,7.1
10958,LeBron James,2019-20,10.2,47.7%,30.8%,25.3,8.5
11555,Russell Westbrook,2020-21,11.7,47.7%,29.5%,22.2,-1.2


## **How do players from different eras (1990s, 2000s, 2010s, 2020s) compare in size, style, and performance??**

In [12]:
def assign_era(season):
    """
    Assigns a historical era (decade) based on the season string.
    Example: '1996-97' -> '1990s'
    """
    start_year = int(season[:4])

    if 1990 <= start_year <= 1999:
        return '1990s'
    elif 2000 <= start_year <= 2009:
        return '2000s'
    elif 2010 <= start_year <= 2019:
        return '2010s'
    elif 2020 <= start_year <= 2029:
        return '2020s'
    else:
        return 'Other'

In [17]:
data['era'] = data['season'].apply(assign_era)

data.head()

Unnamed: 0,player_name,team_abbreviation,age,player_height,player_weight,college,country,draft_year,draft_round,draft_number,...,reb,ast,net_rating,oreb_pct,dreb_pct,usg_pct,ts_pct,ast_pct,season,era
0,Randy Livingston,HOU,22.0,193.04,94.800728,Louisiana State,USA,1996,2,42,...,1.5,2.4,0.3,0.042,0.071,0.169,0.487,0.248,1996-97,1990s
1,Gaylon Nickerson,WAS,28.0,190.5,86.18248,Northwestern Oklahoma,USA,1994,2,34,...,1.3,0.3,8.9,0.03,0.111,0.174,0.497,0.043,1996-97,1990s
2,George Lynch,VAN,26.0,203.2,103.418976,North Carolina,USA,1993,1,12,...,6.4,1.9,-8.2,0.106,0.185,0.175,0.512,0.125,1996-97,1990s
3,George McCloud,LAL,30.0,203.2,102.0582,Florida State,USA,1989,1,7,...,2.8,1.7,-2.7,0.027,0.111,0.206,0.527,0.125,1996-97,1990s
4,George Zidek,DEN,23.0,213.36,119.748288,UCLA,USA,1995,1,22,...,1.7,0.3,-14.1,0.102,0.169,0.195,0.5,0.064,1996-97,1990s


In [20]:
analysis_metrics = {
    'season': 'count',
    'player_height': 'mean',
    'player_weight': 'mean',
    'pts': 'mean',
    'reb': 'mean',
    'ast': 'mean',
    'net_rating': 'mean',
    'usg_pct': 'mean',
    'ts_pct': 'mean',
    'ast_pct': 'mean'
}

era_comparison = data.groupby('era').agg(analysis_metrics)

era_comparison = era_comparison.rename(columns={'season': 'Player_Count'})

print("\n--- üìà Comparaci√≥n de Promedios por Era ---")
display(era_comparison.sort_values(by='era'))


--- üìà Comparaci√≥n de Promedios por Era ---


Unnamed: 0_level_0,Player_Count,player_height,player_weight,pts,reb,ast,net_rating,usg_pct,ts_pct,ast_pct
era,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1990s,1757,200.859693,100.541648,7.829653,3.534035,1.786511,-2.340751,0.188328,0.494335,0.133679
2000s,4469,201.037467,101.3628,8.102506,3.584247,1.782725,-2.149094,0.186365,0.502954,0.1299
2010s,4934,200.600422,100.014506,8.266133,3.550831,1.826429,-2.075598,0.184018,0.519172,0.131329
2020s,1684,198.824382,97.783823,8.747328,3.538064,1.970724,-2.753622,0.178043,0.542104,0.134698


## **Which teams, positions, or player types consistently produce top performers?**

## **Based on the data, who deserves the MVP crown - and how does your pick compare to the official NBA MVP?**