# Segmented Performance Analysis

After the General EDA, I didn't notice much difference between correlations. Now, we are going to get more complicated. I am going to separate each player and put them into a specific groups (Superstars, Secondary Stars, and Role Players). I am then going to start conducting statistical tests and start comparing the results with each other in wins and losses

First, we are going to add the NPER 36 column, which will make players with indifferent minutes played compared more fairly. 

We will then separate players based on who the NBA considers a Superstar, a Secondary Star, and a Role player. 

In [1]:
import pandas as pd

# Load both datasets (or start with one)
wins_df = pd.read_csv("../data/NBATrackingDataWins24-25_clean.csv")
losses_df = pd.read_csv("../data/NBATrackingDataLosses24-25_clean.csv")

# Add per-36 normalization for each numeric column (except minutes)
for df in [wins_df, losses_df]:
    if "MIN" in df.columns:
        numeric_cols = df.select_dtypes("number").columns.drop("MIN")
        for col in numeric_cols:
            df[f"{col}_Per36"] = df[col] / df["MIN"] * 36

# Save the updated datasets
wins_df.to_csv("../data/NBATrackingDataWins24-25_per36.csv", index=False)
losses_df.to_csv("../data/NBATrackingDataLosses24-25_per36.csv", index=False)

# Begin to Categorize Players into Specfic Groups 

We will also remove the accents from the dataset so there isn't any confusion

In [18]:
import unicodedata

wins_df = pd.read_csv("../data/NBATrackingDataWins24-25_per36.csv")
losses_df = pd.read_csv("../data/NBATrackingDataLosses24-25_per36.csv")

# Define Superstar and Secondary Star lists
superstars = [
    "Luka Doncic", "Giannis Antetokounmpo", "Nikola Jokic",
    "Jayson Tatum", "Shai Gilgeous-Alexander", "Stephen Curry",
    "LeBron James", "Kevin Durant", "Joel Embiid", "Anthony Davis",
    "Anthony Edwards", "Jalen Brunson", "Victor Wembanyama",
    "Devin Booker", "Kawhi Leonard", "Trae Young", "Donovan Mitchell"
]

secondary_stars = [
    "Jaylen Brown", "Jimmy Butler III", "De'Aaron Fox",
    "Kyrie Irving", "Ja Morant", "Zion Williamson", "Bam Adebayo",
    "Pascal Siakam", "Brandon Ingram", "Domantas Sabonis", "Karl-Anthony Towns",
    "Jamal Murray", "Darius Garland", "Damian Lillard", "Jalen Williams",
    "Paul George", "Scottie Barnes", "Paolo Banchero", "Tyrese Haliburton",
    "Tyrese Maxey", "Lauri Markkanen", "Mikal Bridges", "Franz Wagner",
    "Desmond Bane", "Chet Holmgren", "Kristaps Porzingis", "Jrue Holiday",
    "Dejounte Murray"
]

# Function to strip accents
def remove_accents(name):
    if isinstance(name, str):
        return ''.join(
            c for c in unicodedata.normalize('NFD', name)
            if unicodedata.category(c) != 'Mn'
        )
    return name

# Categorize players in both datasets 
def categorize_by_name(player):
    if player in superstars:
        return "Superstar"
    elif player in secondary_stars:
        return "Secondary Star"
    else:
        return "Role Player"

# Apply to both datasets

wins_df["PLAYER"] = wins_df["PLAYER"].apply(remove_accents)
losses_df["PLAYER"] = losses_df["PLAYER"].apply(remove_accents)

wins_df["Tier"] = wins_df["PLAYER"].apply(categorize_by_name)
losses_df["Tier"] = losses_df["PLAYER"].apply(categorize_by_name)

wins_df["Tier"].value_counts()

Tier
Role Player       502
Secondary Star     28
Superstar          17
Name: count, dtype: int64

# Start Conducting Statistical Tests Between the Tiers 

In [26]:
metrics = ["DRIVES_Per36","PTS_Per36", "AST_Per36", "PASS_Per36", "TO_Per36", "FTA_Per36", "PF_Per36"]

from scipy import stats

for df, label in [(wins_df, "Wins"), (losses_df, "Losses")]:
    print(f"\n=== {label} ===")
    for metric in metrics:
        f_stat, p_val = stats.f_oneway(
            df[df["Tier"] == "Superstar"][metric],
            df[df["Tier"] == "Secondary Star"][metric],
            df[df["Tier"] == "Role Player"][metric],
        )
        print(f"{metric}: F = {f_stat:.2f}, p = {p_val:.4f}")


=== Wins ===
DRIVES_Per36: F = 34.45, p = 0.0000
PTS_Per36: F = 66.83, p = 0.0000
AST_Per36: F = 20.85, p = 0.0000
PASS_Per36: F = 11.40, p = 0.0000
TO_Per36: F = 3.71, p = 0.0251
FTA_Per36: F = 44.56, p = 0.0000
PF_Per36: F = 44.23, p = 0.0000

=== Losses ===
DRIVES_Per36: F = 32.08, p = 0.0000
PTS_Per36: F = 53.44, p = 0.0000
AST_Per36: F = 3.53, p = 0.0301
PASS_Per36: F = 10.93, p = 0.0000
TO_Per36: F = 7.64, p = 0.0005
FTA_Per36: F = 49.59, p = 0.0000
PF_Per36: F = 48.76, p = 0.0000


# Interpretation 