# Predicting Umpire Accuracy: Data Preparation and Modeling

This notebook guides you through data loading, cleaning, and modeling steps to predict umpire accuracy using MLB data. We’ll use pandas, scikit-learn, matplotlib, and pybaseball.

## 1. Imports

First, we import all necessary libraries.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from pybaseball import schedule_and_record
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import RandomForestRegressor

## 2. Data Loading

We load three CSVs:
- Park mapping
- Umpire games
- Umpire stats

In [None]:
parks = pd.read_csv('datasets/Park_Mapping.csv')
parks_loc = parks[['Team Abv', 'Arena Location']]
games = pd.read_csv('datasets/umpgames.csv')
umpires = pd.read_csv('datasets/umpstats.csv')
time_record_by_game = pd.DataFrame()

## 3. Preprocessing: Creating Unique Game Identifiers

We combine the date and home team to create a unique game key for merging.

In [None]:
games['Date_team'] = games['Date'] + games['Home Team']
parks_list = parks['Pybaseball Abv'].tolist()

## 4. Gather Team Schedule and Record Data

We use pybaseball to get schedules and game records for each team in 2021, appending to `time_record_by_game`.

**Note:** This cell may take a while since it queries data for each team.

In [None]:
for teams in parks_list:
    time_and_record = schedule_and_record(2021, teams)
    time_record_by_game = time_record_by_game.append(time_and_record)

## 5. Cleaning and Formatting Schedule Data

We extract doubleheader info, clean date formats, and standardize 'Games Behind' (GB) columns.

In [None]:
time_record_by_game['Doubleheader Game'] = time_record_by_game['Date'].apply(
    lambda dh: dh[dh.find("(") + 1:dh.find(")")] if dh.find(')') > 0 else 0)
time_record_by_game['Date'] = time_record_by_game['Date'].map(
    lambda date: date[:-4] if date.find(')') > 0 else date)
time_record_by_game['Date_team'] = time_record_by_game['Date'] + time_record_by_game['Tm']
time_record_by_game = time_record_by_game[(time_record_by_game['D/N'] == 'D') | (time_record_by_game['D/N'] == 'N')]
time_record_by_game['GB'] = time_record_by_game['GB'].replace(['Tied'], 0)
time_record_by_game['GB'] = time_record_by_game['GB'].str.replace('up ', '-')
time_record_by_game['GB'] = time_record_by_game['GB'].astype(float) * -1
time_record_by_game['GB'] = time_record_by_game['GB'].fillna(0)

## 6. Merging All Data Sources

We join umpire game logs, umpire season stats, park locations, and team schedule/records into a single DataFrame.

In [None]:
umpire_games = pd.merge(games, umpires, how='inner', on='Umpire', suffixes=("", "_season"))
umpire_games_parks = pd.merge(umpire_games, parks_loc, how='inner', left_on='Home Team', right_on='Team Abv')
umpire_games_parks_time = pd.merge(umpire_games_parks, time_record_by_game, how='inner', on='Date_team')
umpire_games_parks_time = umpire_games_parks_time.drop_duplicates(subset=['ID'])

## 7. Selecting Usable Columns for Analysis

We select relevant columns for downstream analysis and modeling.

In [None]:
usable_cols = [
    'Date_x', 'Umpire', 'Home Team', 'Away Team', 'Home Score', 'Away Score', 'Accuracy',
    'Consistency', 'Favor [Home]', 'Games', 'Called Pitches_season', 'Accuracy_season',
    'Arena Location', 'W/L', 'R', 'Inn', 'GB', 'Time', 'D/N', 'Attendance', 'Streak',
    'Doubleheader Game'
]
usable_data = umpire_games_parks_time[usable_cols]

## 8. Modeling Data Preparation

We build the feature (X) and target (y) datasets, filling missing values.

In [None]:
model_cols = [
    'Games', 'Called Pitches_season', 'Accuracy_season',
    'Inn', 'GB', 'Attendance', 'Streak', 'Doubleheader Game'
]
model_data = umpire_games_parks_time[model_cols].fillna(0)

## 9. Exploratory Function: Plotting

A helper function to scatter plot variables.

In [None]:
def plot_scatter(x_axis, y_axis, input_data):
    plt.scatter(input_data[x_axis], input_data[y_axis])
    plt.xlabel(x_axis)
    plt.ylabel(y_axis)
    plt.show()

## 10. Modeling: Random Forest Regressor

We train and evaluate a Random Forest model to predict umpire accuracy.

In [None]:
x = model_data
y = usable_data['Accuracy']
train_x, val_x, train_y, val_y = train_test_split(x, y, random_state=1)

umpire_performance_model = RandomForestRegressor(random_state=1)
umpire_performance_model.fit(train_x, train_y)
umpire_predictions = umpire_performance_model.predict(val_x)

print("Predicted Accuracies:", umpire_predictions)
print("Mean Absolute Error:", mean_absolute_error(val_y, umpire_predictions))

## Notes
- Ensure all required data files are in a `datasets/` folder relative to your notebook.
- pybaseball may need to be installed via `pip install pybaseball`.
- Depending on your data, some columns may need additional cleaning.