<a href="https://colab.research.google.com/github/JWanderer73/baseball-payroll/blob/main/payroll_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analysis of baseball team payrolls

# Introduction
Pay to Win: MLB’s Need For a Salary Cap
	In 2025, the Los Angeles Dodgers have a projected payroll of around 332 million dollars, while the Miami Marlins have a payroll of about 68 million dollars, nearly a 5x difference.

In the offseason of 2024, the Dodgers dished out 1.4 billion dollars in contracts, and the New York Mets followed that up with their own billion dollar offseason in 2025, reigniting the debate on whether MLB should enforce a salary cap to cap the amount of money per season a team can spend on its roster. Major League Baseball has been around for almost 150 years and is the only league out of the 4 major American leagues (NBA, NFL, NHL, and MLB), to not have any restrictions on a teams payroll. Like the other leagues, there is a luxury tax, to penalize excessive spending, but teams like the Dodgers and the New York Mets, who had tax bills of 103 million dollars and 97 million dollars respectively, didn’t let that stop them from assembling their super teams. Last season, the Dodgers went off to win the World Series, beating the Mets in the National League Championship Series, and the New York Yankees (Second in MLB payroll in 2024) in the World Series.

With this history of success, can we prove that having a higher payroll directly correlates to regular season and postseason success? Can we predict this season’s outcome based off of the 2025 payroll?

Our team predicts that there will be a direct relation between the payroll of a team and their outcome for the season, that the higher payroll teams will have the best records, and the lowest paying teams will have the lowest records. We also predict that an MLB team in the top 5 of payroll for this year, the Dodgers, the Mets, the Yankees, the Toronto Blue Jays, or the Philadelphia Phillies, will finish with the best record in baseball.



## Analysis of Payroll on Team Performance

In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt


# Load the CSV file
df = pd.read_csv("/content/newmlb_run_differentials_2014_2024.csv")
dr = pd.read_csv("/content/newpayroll.csv")

# Select columns for X and y
y = df['2019']  # Make sure X is 2D (with [[]])
X = dr[['Payroll']]                # y is 1D

# Create and fit the model
model = LinearRegression()
model.fit(X, y)

# Print results
print("Slope:", model.coef_[0])
print("Intercept:", model.intercept_)
#print("R² Score:", model.score(X, y))

plt.scatter(X, y, color='blue', label='Data points')

# Plot regression line
plt.plot(X, model.predict(X), color='red', label='Regression line')

# Labels and legend
plt.ylabel('Run Differential')
plt.xlabel('Payroll (hundreds of millions ($))')
plt.title('Linear Regression: Payroll vs Run Differential')
#plt.legend()

# Show plot
plt.show()

FileNotFoundError: [Errno 2] No such file or directory: '/content/newmlb_run_differentials_2014_2024.csv'

Run differential (calculated with runs scored - runs allowed) is a simple team stat that combines the efforts from pitchers, defense, and offense and will generally align with team performance, as teams with a positive run differential will tend to have above a .500 record, and vice versus. The data shows that with an increase in run differential, requires an increase in payroll. A slope of 6.9 shows the large increase in run differential versus the money spent by a team. Teams that spend more will typically have a higher run differential, typically leading to greater records.

## Analysis of Payrolls and Postseason Success

Import necessary packages

In [None]:
import pandas as pd
import plotly.express as px
import sklearn
import numpy as np
import plotly.graph_objects as go

Import fresh dataframes and clean up data

In [None]:
df = pd.read_csv("payroll.csv")
playoffs_df = pd.read_csv('playoffs2.csv')
df['Team'] = df['Team'].str.strip()
df['Active 26-Man'] = df['Active 26-Man'].replace('[\$,]', '', regex=True).astype(float)

FileNotFoundError: [Errno 2] No such file or directory: 'payroll.csv'

Standardize the "Active 26-Man" and "Payroll" columns in df by each year. This should mitigate inflationary effects.

In [None]:
df_list = []
for i in range(7):
    temp = df[df['Year'] == 2025 - i]
    mean = temp['Active 26-Man'].mean()
    std = temp['Active 26-Man'].std()
    temp['Active 26-Man'] = temp['Active 26-Man'] - mean
    temp['Active 26-Man'] = temp['Active 26-Man']/std
    mean2 = temp["Payroll"].mean()
    std2 = temp["Payroll"].std()
    temp["Payroll"] = temp["Payroll"] - mean2
    temp["Payroll"] = temp["Payroll"]/std2
    df_list.append(temp)
df = pd.concat(df_list, ignore_index=True)


Remove years 2025 and 2020 from both dataframes. We removed 2025 because playoffs have not happened yet and 2020 due to the modified playoffs because of COVID-19.

In [None]:
df = df[~df['Year'].isin([2025, 2020])]
playoffs_df = playoffs_df[~playoffs_df['Year'].isin([2025, 2020])]

Create a "Year_Team" Column in both dataframes that we can merge on later.

In [None]:
df['Year_Team'] = df['Year'].astype(str) + " " + df['Team']
playoffs_df['Year_Team'] = playoffs_df['Year'].astype(str) + " " + playoffs_df['Team']

Merge dataframes on "Year_Team", length will stay the same.

In [None]:
df = df.merge(playoffs_df, on='Year_Team')

Replace "No" and "Yes" with 0 and 1 in "Made Playoffs" and "Won World Series" columns, which will be used for the logistic regression.

In [None]:
df['Made Playoffs'] = df['Made Playoffs'].replace({'No': 0, 'Yes': 1})
df['Won World Series'] = df['Won World Series'].replace({'No': 0, 'Yes': 1})

Make Linear Regression model for "Active 26-Man" and "win%". Then, print and store intercept and slope.

In [None]:
X = df[['Active 26-Man']].values
y = df['win%'].values

linear_model = LinearRegression()
linear_model.fit(X, y)

print(f"Coefficient: {linear_model.coef_[0]}")
print(f"Intercept: {linear_model.intercept_}")
coef = linear_model.coef_[0]
intercept = linear_model.intercept_

Same thing as above, but for "Payroll" and "win%".

In [None]:
X = df[['Payroll']].values  # Independent variable
y = df['win%'].values  # Dependent variable

# Create and fit the model
linear_model2 = LinearRegression()
linear_model2.fit(X, y)

# Print the coefficients
print(f"Coefficient: {linear_model2.coef_[0]}")
print(f"Intercept: {linear_model2.intercept_}")

Because both of these models are using standard units for x and have the same y values, a higher slope means a greater correlation. Therefore, "Active 26-Man" has a greater correlation with win% than payroll. Because of this, we will use "Active 26-Man" for the rest of the models and graphs we create.

Creates a scatter plot of "Active 26-Man" and "win%", color coded by team. Then, adds the line of best fit for the data.

In [None]:

fig = px.scatter(
    df,
    x="Active 26-Man",
    y="win%",
    color="Team_x",
    title="Active 26-Man vs Win% by Team",
    labels={"Active 26-Man": "Active 26-Man", "win%": "Win%"}
)
fig.add_trace(go.Scatter(
    x=[-2, 3],
    y=[0.5-2*coef, 0.5 +coef*3],
    mode='lines',
    name='Trendline',
    line=dict(color='black', dash='dash')
))


fig.show()

Note that the Baltimore Orioles, Tampa Bay Rays, Cleveland Guardians, and Milwaukee Brewers all have relativley low payrolls, and perform much better than predicted. However, there are no teams with high payrolls that perform consistently and significantly worse than predicted. This indicates that it is possible to do well on a low budget, but it is very likely all these teams would do better on a higher budget.

Create a logistic model for making the playoffs based on "Active 26-Man" payroll.

In [None]:
playoffs = sklearn.linear_model.LogisticRegression()
payroll = np.array(df["Active 26-Man"])
temp = np.array(df["Made Playoffs"])
playoffs.fit(payroll.reshape(-1, 1), temp)

Create a logistic model for winning the world series based on "Active 26-Man" payroll.

In [None]:
world = sklearn.linear_model.LogisticRegression()
payroll = np.array(df["Active 26-Man"])
won = np.array(df["Won World Series"])
world.fit(payroll.reshape(-1, 1), won)

Plots logistic model "world".

In [None]:
X_test = np.linspace(-10, 10, 300).reshape(-1, 1)
y_prob = world.predict_proba(X_test)[:, 1]
plt.plot(X_test, y_prob, label="Logistic Function")
plt.xlabel("X")
plt.ylabel("Probability")
plt.title("Logistic Regression Sigmoid Curve")
plt.grid(True)
plt.legend()
plt.show()

Plots logistic model "playoffs".

In [None]:
X_test = np.linspace(-3, 3, 300).reshape(-1, 1)
y_prob = playoffs.predict_proba(X_test)[:, 1]
plt.plot(X_test, y_prob, label="Logistic Function")
plt.xlabel("X")
plt.ylabel("Probability")
plt.title("Logistic Regression Sigmoid Curve")
plt.grid(True)
plt.legend()
plt.show()

Create a linear model for making the playoffs based on "Active 26-Man" payroll. We decied to do this becuase it is much easiar to interpret than a logistic function, however it is not as good for making predictions, as it may predict below 0 or above 1.

In [None]:
playoffs_linear = sklearn.linear_model.LinearRegression()
playoffs_linear.fit(payroll.reshape(-1, 1), temp)
print(f"Coefficient: {playoffs_linear.coef_[0]}")
print(f"Intercept: {playoffs_linear.intercept_}")

Create a linear model for winning the world series based on "Active 26-Man" payroll.

In [None]:
world_linear = sklearn.linear_model.LinearRegression()
world_linear.fit(payroll.reshape(-1, 1), won)
print(f"Coefficient: {world_linear.coef_[0]}")
print(f"Intercept: {world_linear.intercept_}")

Some more interpretations