<h1>NBA Dataset analysis: Team & Context data</h1>

This notebook will give you insight about Plyaers, Team and Context

we will create few vizualization, refine model and find relations among values.

In [None]:
#Loading libraries 

In [None]:
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import os 


pd.set_option("display.max_columns",50)
pd.set_option("display.precision",3)

In [None]:
data_path = "/kaggle/input/nba-analytics-dataset-players-teams-and-context"

players = pd.read_csv(f"{data_path}/players_master.csv")
teams = pd.read_csv(f"{data_path}/teams_master.csv")
context = pd.read_csv(f"{data_path}/context_master.csv")

print("Files loaded successfully!")

In [None]:
players.shape,teams.shape,context.shape

In [None]:
#Displaying some columns from all dataset

print("Players:")
display(players.head())

print("\nTeams:")
display(teams.head())

print("\nContext:")
display(context.head())

In [None]:
#Checking for missing vlaues

def quick_info(df,name):
    print(f"==={name}===")
    print("Shape:",df.shape)
    print("\nInfo:")
    df.info()
    print("\nMissing values:")
    display(df.isnull().sum().sort_values(ascending= False).head(15))
    print("\n"+"="*60+"\n")

quick_info(players,"Players")
quick_info(teams,"Teams")
quick_info(context,"Context")

<h3>Player dataset analysis</h3>

Now we find insights from player performance such as points , time played and overall ranking

In [None]:
top_players = (
    players.sort_values("BASE_PTS", ascending = False).loc[:,["PLAYER_NAME","BASE_TEAM_ABBREVIATION","BASE_GP","BASE_MIN","BASE_PTS"]]
    .head(15)
)

top_players.reset_index(drop = True,inplace = True)
top_players

<h3>Plotting a graph of BASE_PTS for first 15 players</h3>

In [None]:
plt.figure(figsize=(12,6))

sns.barplot(
    data=top_players,
    x="PLAYER_NAME",
    y="BASE_PTS",
    hue="BASE_TEAM_ABBREVIATION",
    dodge = False
)

plt.title("Top 15 player with highest BASE_PTS")
plt.xlabel("Players")
plt.ylabel("Base points")
plt.xticks(rotation = 45, ha ="right")
plt.legend(title="Team",bbox_to_anchor=(1.05,1))
plt.show()

<h3>Displaying time taken by players against the number of points scored</h3>

In [None]:
plt.figure(figsize=(8,6))

sns.scatterplot(
    data=players,
    x="BASE_MIN",
    y="BASE_PTS",
    alpha =0.6
)

plt.title("Minutes vs points")
plt.xlabel("BASE_MIN")
plt.ylabel("BASE_PTS")
plt.tight_layout()
plt.show()


<h3>Team dataset analysis</h3>

Now we will find patterns/observation such as win %, offensive, defensive rating and net rating

In [None]:
teams_data=(
    teams.loc[:,["TEAM_NAME","BASE_GP","BASE_W","BASE_L","BASE_W_PCT"]]
    .sort_values("BASE_W_PCT",ascending = False)
)

teams_data.head(10)

<h3>Plotting graph of teams win by % against team names</h3>

In [None]:
plt.figure(figsize=(12,6))

sns.barplot(
    data = teams_data,
    x="TEAM_NAME",
    y="BASE_W_PCT",
)

plt.title("Team win percentage vs Team name")
plt.xlabel("Team names")
plt.ylabel("Win %")
plt.xticks(rotation =45, ha="right")
plt.tight_layout()
plt.show()

It is clear from the above plot that "Oklahoma city thunders" has won most of the time with a winrate of ~95+%

<h3>Now let's Analyze teams defensive, offensive and net rating with a special focus on team Oklahoma city thunder(most winning team)</h3>

In [None]:
rating_cols=["ADV_OFF_RATING","ADV_DEF_RATING","ADV_NET_RATING"]

for col in rating_cols:
    if col not in rating_cols:
        print(f"Column{col} not found")

teams_data = teams[["TEAM_NAME","ADV_OFF_RATING","ADV_DEF_RATING","ADV_NET_RATING"]].copy()

plt.figure(figsize=(8,6))

sns.scatterplot(
    data = teams_data,
    x="ADV_OFF_RATING",
    y="ADV_DEF_RATING",
    size="ADV_NET_RATING",
    sizes = (50,250)
)

sns.scatterplot(
    data=okc_stats, 
    x='ADV_OFF_RATING', 
    y='ADV_DEF_RATING', 
    color='orange', 
    s=250, 
    label='Oklahoma City Thunder',
    edgecolor='black'
)

plt.title("Teams playing offensive vs Teams playing defensive (Bubble showcase net rating)")
plt.xlabel("Offensive Rating(Higher = better offense)")
plt.ylabel("Defensive Rating(Higher = better offense)")
plt.tight_layout()
plt.show()


The graph above clearly shows that most of the team that won the most (Oklahoma City thunder) plays a lot offensive than defensive (\~120,105.0) whereas other team gives equal emphaise on offense and defense(\~112.5,115) while some even playing more Defesive than offensive

<h3>Carrying out player level analysis for team Oklahoma to see which players played Offensive and defensive  </h3>

In [None]:
okc_players=players[players['BASE_TEAM_ABBREVIATION']=='OKC']

plt.figure(figsize=(12,8))
sns.barplot(
    data = okc_players.sort_values('BASE_PTS',ascending = False).head(30),
    x="BASE_PTS",
    y="PLAYER_NAME",
    palette="Oranges_r"
)

plt.title("Top scorers: Oklahoma City Thunder")
plt.xlabel('Points per game')
plt.show()

As per the data visualization we can see that "Shai Gilgeous-Alexander" has been consistent scorer contributing ~30+ points per game, while "Brooks Barnhizer" being the lowest scorer with ~2 points per game

<h3>Let's Analyze Context/Injuries & referees</h3>

We will now bw analyzing relationship between Injusries, referees and context of players

In [None]:
context["CTX_TYPE"].value_counts()

There were total 114 injuries and all the game had 121 referees in total

<h3>Now, we will analyze information regarding all referees</h3>

In [None]:
refs = context[context["CTX_TYPE"] == "REF"].copy()

ref_cols = [
    "REFEREE",
    "ROLE",
    "EXPERIENCE_(YEARS)",
    "GAMES_OFFICIATED",
    "HOME_TEAM_WIN%",
    "TOTAL_POINTS_PER_GAME",
    "CALLED_FOULS_PER_GAME"
]

refs[ref_cols].head(200)

<h3>Finding the Top referees in all the given characteristics</h3>

In [None]:
most_exp= refs.loc[refs['EXPERIENCE_(YEARS)'].idxmax()]
highest_ppg = refs.loc[refs['TOTAL_POINTS_PER_GAME'].idxmax()]
highest_home_win=refs.loc[refs['HOME_TEAM_WIN%'].idxmax()]
most_fouls = refs.loc[refs['CALLED_FOULS_PER_GAME'].idxmax()]
game_officiated= refs.loc[refs['GAMES_OFFICIATED'].idxmax()]

print(f"--- Referee Leaders ---")
print(f"Most Experienced:{most_exp['REFEREE']} ({most_exp['EXPERIENCE_(YEARS)']} years)")
print(f"Highest PPG: {highest_ppg['REFEREE']} ({highest_ppg['TOTAL_POINTS_PER_GAME']} pts)")
print(f"Highest Home Win %: {highest_home_win['REFEREE']} ({highest_home_win['HOME_TEAM_WIN%']:.2f}%)")
print(f"Most Fouls Called: {most_fouls['REFEREE']} ({most_fouls['CALLED_FOULS_PER_GAME']} per game)")

<h3>Visualizing the realtionship between referees and their home town win % </h3>

In [None]:
plt.figure(figsize=(8, 6))

sns.scatterplot(
    data=refs,
    x="EXPERIENCE_(YEARS)",
    y="HOME_TEAM_WIN%",
    size="GAMES_OFFICIATED",
    sizes=(50, 250),
    alpha=0.7
)

plt.title("Referee Experience vs Home Team Win%")
plt.xlabel("Experience (years)")
plt.ylabel("Home Team Win%")
plt.tight_layout()
plt.show()



A large bubble on the far right, near the 55-60% win line. These are the most trusted refs who call consistent games over decades.

A bubble high up on the Y-axis. When this ref is on the floor, the home team wins significantly more often.

A bubble low on the Y-axis. This ref may be less influenced by the home crowd, or perhaps even "over-corrects," leading to more away-team wins.

<h3>Top 15 Referees and their relation to game officiated and points per match</h3>

In [None]:
refs_ppg=(
    refs.sort_values("TOTAL_POINTS_PER_GAME",ascending = False)
    .loc[:,["REFEREE","ROLE","GAMES_OFFICIATED","TOTAL_POINTS_PER_GAME"]]
    .head(15)
    .reset_index(drop= True)
)

refs_ppg

We can makeout from the table that the more POINTS_PER_GAME the refree has the less OFFICIATED game they have(well, kind of) the ROLE does not show any consistent relation with other variables

<h3>Plotting graph for Top 15 REFEREE vs POINTS PER GAME</h3>

In [None]:
plt.figure(figsize=(12,6))

sns.barplot(
    data=refs_ppg,
    x="REFEREE",
    y="TOTAL_POINTS_PER_GAME",
    hue="ROLE",
    dodge=False
)

plt.title("Top 15 Referees by Total Points per Game")
plt.xlabel("Referee")
plt.ylabel("Total Points per Game")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()



We can see that mostly all refree in Top 15 have similar range of score "Dedric Taylor" being the highest with 250+ points per game

<h3>We were sucessfully able to analyze relationship between different variables and findout meaningful information regarding the NBA teams, context and referees</h3>

<h1>===Here we conclude our analysis of the notebook===</h1>