In [56]:
!pip install lets_plot



In [57]:
# from google.colab import drive
import pandas as pd
from lets_plot import *
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from scipy.stats import bootstrap

In [58]:
# drive.mount('/content/drive')
# pd.set_option('display.max_columns', None)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Do MLB Star Players Underperform in the Postseason?

The MLB postseason is renowned for superstars delivering iconic moments and solidifying their legacies. However, after observing the 2024 postseason, Aaron Judge, one of baseball’s biggest stars, had an uncharacteristically poor performance. Coming off a regular season where he led the league in nearly every offensive category and won his second MVP in three seasons, Judge posted a .752 OPS—a significant drop from his regular-season OPS of 1.159. This sparked my interest in investigating whether "star" players consistently experience a decline in performance during the playoffs compared to the regular season. I hypothesize that star batters tend to underperform in the postseason, and understanding the factors behind this trend can provide valuable insights into player consistency under pressure and the dynamics of playoff performance.


## Dataset and Methodology

- **Data Source**: [FanGraphs](https://www.fangraphs.com) (2010-2024 seasons, excluding 2020).
- **Star Player Definition**:
    - Batters: WAR ≥ 3.75.
- **Metrics**:
    - Batters: OBP, SLG, pLI


The specific statistic we will be analyzing for batters is slugging (SLG) and on-base percentage (OBP). SLG measures a batter's power by calculating the total number of bases they earn per at-bat, with extra-base hits (doubles, triples, home runs) weighted more heavily than singles. This makes it an important metric for understanding a player’s ability to drive in runs and produce impactful hits. OBP evaluates a batter’s overall ability to get on base through hits, walks, and being hit by pitches, capturing their effectiveness at avoiding outs and keeping innings alive. Together, SLG and OBP provide a comprehensive picture of a batter’s offensive production as it balances the importance of power and consistency. In addition to these we will be looking at the player leverage index (pLI) to evaluate batters in differing pressure situations between the regular season and postseason.

In [59]:
# Define the base paths for data relative to the repo
regular_szn_data_directory = "./data/Regular_Szn/"
postseason_data_directory = "./data/Postseason/"
seasons = [2024, 2023, 2022, 2021, 2019, 2018, 2017, 2016, 2015, 2014, 2013, 2012, 2011, 2010]

# Generalized function to load and merge season data
def load_and_merge_season_data(season, data_directory, suffix):
    """
    Load and merge season data for a given season.

    Parameters:
    - season: The year of the season (e.g., 2024).
    - data_directory: The base path for the data (e.g., regular or postseason).
    - suffix: The suffix for the file naming convention ('_rs' for regular season, '_ps' for postseason).

    Returns:
    - merged_season_data: The merged DataFrame for the given season.
    """
    main_stats_file = f"{data_directory}fangraphs-leaderboards_{season}{suffix}.csv"
    win_stats_file = f"{data_directory}fangraphs-leaderboards_{season}_win{suffix}.csv"

    # Load datasets
    main_stats_data = pd.read_csv(main_stats_file)
    win_stats_data = pd.read_csv(win_stats_file)

    # Merge datasets on the 'Name' column
    merged_season_data = main_stats_data.merge(win_stats_data, on='Name')
    return merged_season_data

# Load and merge regular season data
regular_season_dataframes = [
    load_and_merge_season_data(season, regular_szn_data_directory, '_rs') for season in seasons
]

# Load and merge postseason data
postseason_dataframes = [
    load_and_merge_season_data(season, postseason_data_directory, '_ps') for season in seasons
]

# Concatenate all regular season and postseason DataFrames
combined_regular_season_df = pd.concat(regular_season_dataframes, ignore_index=True)
combined_postseason_df = pd.concat(postseason_dataframes, ignore_index=True)

# Aggregate by 'Name' and compute averages for numeric columns
aggregated_regular_season_df = combined_regular_season_df.groupby('Name').mean(numeric_only=True).reset_index()
aggregated_postseason_df = combined_postseason_df.groupby('Name').mean(numeric_only=True).reset_index()

# Display results
star_players_regular_szn = aggregated_regular_season_df.query("WAR > 3.75")
star_players_postseason = aggregated_postseason_df[aggregated_postseason_df['Name'].isin(star_players_regular_szn['Name'])].reset_index()
# star_players_postseason


## Batters Comparison Regular Season vs Postseason

#### Regular Season & Postseason Star Players Bootstrapping

In the postseason, the smaller sample size introduces greater variability among players' statistics, making traditional methods for estimating averages less reliable. Since a main component of our project is analyzing postseason statistics, we had to implement bootstrapping to compensate for this issue with our data. Bootstrapping allows us to address this issue by resampling the data repeatedly to create a distribution of possible means. We used bootstrapping to provide us an estimate of the true mean for slugging percentage, on-base percentage, earned run average, and fielding independent pitching, ensuring that our comparisons between regular season and postseason performance were robust despite the smaller postseason sample size. Therefore, by resampling we ensured that our results were not overly influenced by the variability in smaller datasets and accounted for the limited sample sizes in the postseason compared to the regular season.

In [74]:
# Define a function to perform bootstrapping
def bootstrap_mean(data, confidence_level=0.95, n_resamples=10000):
    """
    Perform bootstrapping to calculate the mean and confidence interval.

    Parameters:
    - data: The data array for bootstrapping.
    - confidence_level: The desired confidence level for the interval (default: 95%).
    - n_resamples: Number of resamples to perform (default: 10,000).

    Returns:
    - mean_value: Mean of the original data.
    - confidence_interval: Confidence interval from the bootstrap.
    """
    result = bootstrap(
        data=(data,),                    # Data to bootstrap
        statistic=np.mean,               # Statistic to compute (mean)
        confidence_level=confidence_level, # Confidence level
        n_resamples=n_resamples,         # Number of resamples
        method='basic'                   # Bootstrap method
    )
    return np.mean(data), result.confidence_interval

# Apply the function to different datasets
mean_obp_regszn, ci_obp_regszn = bootstrap_mean(star_players_regular_szn['OBP'])
mean_slg_regszn, ci_slg_regszn = bootstrap_mean(star_players_regular_szn['SLG'])
mean_obp_postszn, ci_obp_postszn = bootstrap_mean(star_players_postseason['OBP'])
mean_slg_postszn, ci_slg_postszn = bootstrap_mean(star_players_postseason['SLG'])


#### Calculating Statical Values

In [75]:
# Calculate the mean values
average_stats_reg = star_players_regular_szn[['OBP', "SLG"]].mean().reset_index()

average_stats_reg.columns = ["Statistic", "Average Stats of Stars Regular Season"]

In [76]:
# Calculate the mean values
average_stats_postseason = star_players_postseason[['OBP', "SLG"]].mean().reset_index()

average_stats_postseason.columns = ["Statistic", "Average Stats of Stars Postseason"]

### Regular Season Star Players Visualization Code

In [77]:
# Data for the mean point
mean_point_data = {'x': [mean_slg_regszn], 'y': [mean_obp_regszn], 'type': ['Mean Point']}


# Plot
regular_season_batters = ggplot(star_players_regular_szn) + \
    geom_point(aes(x='SLG', y='OBP'), color='black') + \
    geom_vline(xintercept=confidence_interval_slg_regszn.low, color='blue', linetype='dashed') + \
    geom_vline(xintercept=confidence_interval_slg_regszn.high, color='blue', linetype='dashed') + \
    geom_hline(yintercept=confidence_interval_obp_regszn.low, color='blue', linetype='dashed') + \
    geom_hline(yintercept=confidence_interval_obp_regszn.high, color='blue', linetype='dashed') + \
    geom_point(data=mean_point_data, mapping=aes(x='x', y='y', color='type'), size=4) + \
    scale_x_continuous(limits=[0, 1]) + \
    scale_y_continuous(limits=[0, 0.6]) + \
    ggtitle("Slugging (SLG) vs On Base Percentage (OBP) Postseason") + \
    ggsize(800, 600) + \
    xlab("SLG") + \
    ylab("OBP") + \
    scale_color_manual(values={'Confidence Interval': 'blue', 'Mean Point': 'red'}, name='Legend')

# regular_season_batters

### Postseason Star Players Visualization Code

In [78]:
# Data for the mean point
mean_point_data = {'x': [mean_slg_postszn], 'y': [mean_obp_postszn], 'type': ['Mean Point']}


# Plot
postseason_batters = ggplot(star_players_postseason) + \
    geom_point(aes(x='SLG', y='OBP'), color='black') + \
    geom_vline(xintercept=confidence_interval_slg_postszn.low, color='blue', linetype='dashed') + \
    geom_vline(xintercept=confidence_interval_slg_postszn.high, color='blue', linetype='dashed') + \
    geom_hline(yintercept=confidence_interval_obp_postszn.low, color='blue', linetype='dashed') + \
    geom_hline(yintercept=confidence_interval_obp_postszn.high, color='blue', linetype='dashed') + \
    geom_point(data=mean_point_data, mapping=aes(x='x', y='y', color='type'), size=4) + \
    scale_x_continuous(limits=[0, 1]) + \
    scale_y_continuous(limits=[0, 0.6]) + \
    ggtitle("Slugging (SLG) vs On Base Percentage (OBP) Regular Season") + \
    ggsize(800, 600) + \
    xlab("SLG") + \
    ylab("OBP") + \
    scale_color_manual(values={'Confidence Interval': 'blue', 'Mean Point': 'red'}, name='Legend')

# postseason_batters

### Star Batters Plots Postseason vs Regular Season



In [79]:
regular_season_batters

In [80]:
average_stats_reg

Unnamed: 0,Statistic,Average Stats of Stars Regular Season
0,OBP,0.353168
1,SLG,0.475925


In [81]:
postseason_batters

In [82]:
average_stats_postseason

Unnamed: 0,Statistic,Average Stats of Stars Postseason
0,OBP,0.288572
1,SLG,0.349703


### **Batters Anaylsis:**

These plots include 95% confidence intervals to estimate the range where the true means for SLG and OBP lie, with red dots representing the actual means. Regular-season confidence intervals are higher, reflecting significantly larger mean values for both metrics (SLG: 0.476, OBP: 0.353) compared to the postseason (SLG: 0.350, OBP: 0.289). Regular-season data is more tightly clustered, indicating greater consistency, while postseason data shows wider variability and reduced offensive production, suggesting that star batters tend to underperform in the postseason. I hypothesize that the drop in performance is influenced by the higher quality of pitching faced in the postseason. During the regular season, the average ERA of pitchers faced by star players is 4.104, compared to 3.749 in the postseason. This indicates that star players face weaker competition in the regular season, likely providing more opportunities for offensive success. The tougher competition in the postseason, as reflected by the lower average ERA, poses greater challenges. Elite postseason pitchers are more effective at limiting hits, walks, and extra-base plays, which likely contributes to the observed decline in OBP and SLG for star players. This heightened level of competition highlights the difficulty star players face in maintaining their regular season production during the postseason and is a potential reason as to why we see dip in performance during the postseason.

### Clutch Analysis: OBP vs. pLI

In this section, we analyze OBP by leverage level to determine if star players perform differently under pressure, especially in the postseason. The Player Leverage Index (pLI) quantifies game situations, with higher values indicating critical, high-pressure moments. For this analysis, leverage levels were categorized as low (routine moments) and high (key, high-pressure moments). Postseason games often feature a higher proportion of high-leverage situations compared to the regular season, increasing the pressure on players to deliver in pivotal moments.



In [72]:
# Add Season column to both DataFrames
star_players_regular_szn = star_players_regular_szn.copy()
star_players_regular_szn.loc[:, 'Season'] = 'Regular'

star_players_postseason = star_players_postseason.copy()
star_players_postseason.loc[:, 'Season'] = 'Postseason'

combined_df = pd.concat([star_players_regular_szn, star_players_postseason], ignore_index=True)

# Define bins for only Low and High leverage situations
combined_df['pLI_level'] = pd.cut(
    combined_df['pLI'],
    bins=[0, 1.0, combined_df['pLI'].max()],  # Two bins: <= 1.0 (Low) and > 1.0 (High)
    labels=['Low', 'High']
)

# Group by season and pLI levels and calculate mean OBP
stats_by_pli = combined_df.groupby(['Season', 'pLI_level'], observed=True)[['OBP']].mean().reset_index()

# Plot
ggplot(stats_by_pli) + \
    geom_bar(aes(x='pLI_level', y='OBP', fill='Season'), stat='identity', position='dodge') + \
    ggtitle('OBP by Leverage Level: Regular vs. Postseason') + \
    scale_fill_manual(values={'Regular': 'red', 'Postseason': 'blue'}) + \
    ggsize(800, 600) + xlab("pLI Level")


**Analysis**:

The OBP by leverage level plots reveal that star players tend to struggle more in high-pressure postseason moments compared to the regular season. In the regular season OBP is consistent across low and high leverage levels, indicating that star players handle these situations with relative stability. However, in the postseason, OBP is noticeably lower in both low and high leverage situations, with the most significant decline observed in high-leverage moments. This suggests that the postseason environment with its heightened stakes and frequent critical situations, adds pressure on star players leading to a decline in their offensive performance.

### Conclusion

Our analysis of FanGraphs data reveals a clear trend: star MLB players tend to underperform in the postseason compared to their regular season metrics. Key statistics like SLG and OBP show measurable declines for batters during the playoffs, supporting our hypothesis. Factors such as tougher competition and the unique pressures of postseason baseball likely contribute to this drop in performance. These findings underscore the importance of mental resilience and adaptability for players transitioning from regular-season dominance to playoff success, offering valuable insights for teams, analysts, and fans.