<a href="https://colab.research.google.com/github/RajSingh23/Projects/blob/main/HitPredictor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import warnings
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score, roc_auc_score

warnings.filterwarnings('ignore')

In [None]:
# Load the dataset
data = pd.read_csv('savant_data.csv')

# Remove samples with missing values
data = data.dropna(subset=['launch_speed', 'launch_angle', 'events'])

# Select relevant columns
data = data[['launch_speed', 'launch_angle', 'events']]

# Create 'is_hit' column based on hit events
hit_events = ['single', 'double', 'triple', 'home_run']
data['is_hit'] = data['events'].apply(lambda x: 1 if x in hit_events else 0)

# Select features and target
features = ['launch_speed', 'launch_angle']
X = data[features]
y = data['is_hit']

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize models
models = {
    "Logistic Regression": LogisticRegression(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "Gradient Boosting": GradientBoostingClassifier()
}

# Train and evaluate each model
for model_name, model in models.items():
    print(f"\nTraining {model_name}...")
    model.fit(X_train, y_train)

    # Predict on the test set
    y_pred = model.predict(X_test)

    # Evaluate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    print(f'{model_name} Accuracy: {accuracy}')

    # Evaluate ROC-AUC for models that support probability prediction
    if hasattr(model, "predict_proba"):
        y_prob = model.predict_proba(X_test)[:, 1]
        roc_auc = roc_auc_score(y_test, y_prob)
        print(f'{model_name} ROC-AUC: {roc_auc}')


Training Logistic Regression...
Logistic Regression Accuracy: 0.70028067361668
Logistic Regression ROC-AUC: 0.684333400974026

Training Decision Tree...
Decision Tree Accuracy: 0.7530072173215717
Decision Tree ROC-AUC: 0.7285953180342384

Training Random Forest...
Random Forest Accuracy: 0.7784683239775461
Random Forest ROC-AUC: 0.8280048147874852

Training Gradient Boosting...
Gradient Boosting Accuracy: 0.8067361668003208
Gradient Boosting ROC-AUC: 0.86132314418536


From the moment the ball is hit, if we use gradient boosting, we can semi-reliably predict whether it will result in a hit or not from the exit velocity and launch angle. Now, we want to see if using a created statistic, "launch_speed_angle", allows us to get a similar or more accurate prediction. This statistic uses both exit velocity and launch angle to classify a batted ball as one of six categories ranging from "weak" to "barrel".

In [None]:
# Load the dataset
data = pd.read_csv('savant_data.csv')

# Remove samples with missing values
data = data.dropna(subset=['launch_speed_angle', 'events'])

# Select relevant columns including 'launch_speed_angle'
data = data[['launch_speed_angle', 'events']]

# Create 'is_hit' column based on hit events
hit_events = ['single', 'double', 'triple', 'home_run']
data['is_hit'] = data['events'].apply(lambda x: 1 if x in hit_events else 0)

# Select features and target
features = ['launch_speed_angle']
X = data[features]
y = data['is_hit']

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize models
models = {
    "Logistic Regression": LogisticRegression(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "Gradient Boosting": GradientBoostingClassifier()
}

# Train and evaluate each model
for model_name, model in models.items():
    print(f"\nTraining {model_name}...")
    model.fit(X_train, y_train)

    # Predict on the test set
    y_pred = model.predict(X_test)

    # Evaluate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    print(f'{model_name} Accuracy: {accuracy}')

    # Evaluate ROC-AUC for models that support probability prediction
    if hasattr(model, "predict_proba"):
        y_prob = model.predict_proba(X_test)[:, 1]
        roc_auc = roc_auc_score(y_test, y_prob)
        print(f'{model_name} ROC-AUC: {roc_auc}')


Training Logistic Regression...
Logistic Regression Accuracy: 0.7089013632718525
Logistic Regression ROC-AUC: 0.7446428571428572

Training Decision Tree...
Decision Tree Accuracy: 0.7921010425020049
Decision Tree ROC-AUC: 0.8121264389020071

Training Random Forest...
Random Forest Accuracy: 0.7921010425020049
Random Forest ROC-AUC: 0.8121264389020071

Training Gradient Boosting...
Gradient Boosting Accuracy: 0.7921010425020049
Gradient Boosting ROC-AUC: 0.8121264389020071


Interestingly, the simpler models benefited from this attribute being used while the more successful, more complicated models performed worse with the "launch_speed_angle" attribute. From knowledge of decision trees and logistic regression, we know they can be more prone to overfitting, so this simpler combination of launch angle and exit velocity benefits them. Using ensemble methods it seems we can get a slightly better prediction if we use both launch angle and exit velocity. This exemplifies how ensemble methods resist overfitting.

In [None]:
#################################################################################################################################

It makes sense that we could make a strong prediction whether a hit occurs based on how the ball is hit, but can we make a strong prediction before the ball is even hit? Next, we're going to study count and see if we can use feature engineering to improve our models.

First, we are going to train our models based on ball and strike data to get a benchmark of how well count allows us to predict whether a hit occurs on a batted ball.

In [None]:
#Load the dataset
data = pd.read_csv('savant_data.csv')

# Remove samples with missing values
data = data.dropna(subset=['balls', 'strikes', 'events'])

# Select relevant columns
data = data[['balls', 'strikes', 'events']]

# Create 'is_hit' column based on hit events
hit_events = ['single', 'double', 'triple', 'home_run']
data['is_hit'] = data['events'].apply(lambda x: 1 if x in hit_events else 0)

# Select features and target
features = ['balls', "strikes"]
X = data[features]
y = data['is_hit']

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize models
models = {
    "Logistic Regression": LogisticRegression(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "Gradient Boosting": GradientBoostingClassifier()
}

# Train and evaluate each model
for model_name, model in models.items():
    print(f"\nTraining {model_name}...")
    model.fit(X_train, y_train)

    # Predict on the test set
    y_pred = model.predict(X_test)

    # Evaluate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    print(f'{model_name} Accuracy: {accuracy}')

    # Evaluate ROC-AUC
    if hasattr(model, "predict_proba"):
        y_prob = model.predict_proba(X_test)[:, 1]
        roc_auc = roc_auc_score(y_test, y_prob)
        print(f'{model_name} ROC-AUC: {roc_auc}')


Training Logistic Regression...
Logistic Regression Accuracy: 0.673
Logistic Regression ROC-AUC: 0.5158811474478691

Training Decision Tree...
Decision Tree Accuracy: 0.673
Decision Tree ROC-AUC: 0.5032539498616355

Training Random Forest...
Random Forest Accuracy: 0.673
Random Forest ROC-AUC: 0.501985995428748

Training Gradient Boosting...
Gradient Boosting Accuracy: 0.673
Gradient Boosting ROC-AUC: 0.5032539498616355


Now, I want to do some feature engineering to see if we can create a feature that is beneficial for model performance. For our feature, I want to create different "bins" of counts and see if categorizing similar counts in these bins will improve our models' performances. There are twelve possible counts in baseball and first, we will try 3 groups of 4 counts based on batting average. Based on data from the 2023 season, the groups are as follows: [0-2, 1-2, 2-2, 3-2] as the worst, [2-0, 0-1, 1-1, 2-1] in the middle, and [0-0, 1-0, 3-0, 3-1] as the best. The code and performance is below:

In [None]:
# Load the dataset
data = pd.read_csv('savant_data.csv')

# Remove samples with missing values
data = data.dropna(subset=['balls', 'strikes', 'events'])

# Select relevant columns
data = data[['balls', 'strikes', 'events']]

# Create 'is_hit' column based on hit events
hit_events = ['single', 'double', 'triple', 'home_run']
data['is_hit'] = data['events'].apply(lambda x: 1 if x in hit_events else 0)

# Define the count groups based on the 2023 season batting averages
worst_counts = [(0, 2), (1, 2), (2, 2), (3, 2)]
middle_counts = [(2, 0), (0, 1), (1, 1), (2, 1)]
best_counts = [(0, 0), (1, 0), (3, 0), (3, 1)]

def assign_count_group(row):
    count = (row['balls'], row['strikes'])
    if count in worst_counts:
        return 'worst'
    elif count in middle_counts:
        return 'middle'
    elif count in best_counts:
        return 'best'
    else:
        return 'other'

# Create the 'count_group' feature
data['count_group'] = data.apply(assign_count_group, axis=1)

# Convert 'count_group' to categorical codes
data['count_group'] = data['count_group'].astype('category').cat.codes

# Select features and target
features = ['count_group']
X = data[features]
y = data['is_hit']

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize models
models = {
    "Logistic Regression": LogisticRegression(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "Gradient Boosting": GradientBoostingClassifier()
}

# Train and evaluate each model
for model_name, model in models.items():
    print(f"\nTraining {model_name}...")
    model.fit(X_train, y_train)

    # Predict on the test set
    y_pred = model.predict(X_test)

    # Evaluate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    print(f'{model_name} Accuracy: {accuracy}')

    # Evaluate ROC-AUC for models that support probability prediction
    if hasattr(model, "predict_proba"):
        y_prob = model.predict_proba(X_test)[:, 1]
        roc_auc = roc_auc_score(y_test, y_prob)
        print(f'{model_name} ROC-AUC: {roc_auc}')



Training Logistic Regression...
Logistic Regression Accuracy: 0.673
Logistic Regression ROC-AUC: 0.517212263315021

Training Decision Tree...
Decision Tree Accuracy: 0.673
Decision Tree ROC-AUC: 0.517212263315021

Training Random Forest...
Random Forest Accuracy: 0.673
Random Forest ROC-AUC: 0.517212263315021

Training Gradient Boosting...
Gradient Boosting Accuracy: 0.673
Gradient Boosting ROC-AUC: 0.517212263315021


Using these bins failed to improve the model performance in a meaningful way. Knowing just balls and strikes is not enough to improve performance over what an average baseball fan could predict. The bins are also guided by how humans look at the count, so it does not allow the model to perform better than humans.

In [None]:
#####################################################################################################################################

Let's go back to trying to get the best prediction we can possibly get. Here we'll use a couple features to see if pre-pitch data can help us.

In [None]:
#Load the dataset
data = pd.read_csv('savant_data.csv')

# Remove samples with missing values
data = data.dropna(subset=['launch_angle', 'launch_speed', 'zone', 'pitcher', 'batter', 'events'])

# Select relevant columns
data = data[['launch_angle', 'launch_speed', 'zone', 'pitcher', 'batter', 'events']]

# Create 'is_hit' column based on hit events
hit_events = ['single', 'double', 'triple', 'home_run']
data['is_hit'] = data['events'].apply(lambda x: 1 if x in hit_events else 0)

# Select features and target
features = ['launch_angle', 'launch_speed', 'zone', 'pitcher', 'batter']
X = data[features]
y = data['is_hit']

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize models
models = {
    "Logistic Regression": LogisticRegression(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "Gradient Boosting": GradientBoostingClassifier()
}

# Train and evaluate each model
for model_name, model in models.items():
    print(f"\nTraining {model_name}...")
    model.fit(X_train, y_train)

    # Predict on the test set
    y_pred = model.predict(X_test)

    # Evaluate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    print(f'{model_name} Accuracy: {accuracy}')

    # Evaluate ROC-AUC
    if hasattr(model, "predict_proba"):
        y_prob = model.predict_proba(X_test)[:, 1]
        roc_auc = roc_auc_score(y_test, y_prob)
        print(f'{model_name} ROC-AUC: {roc_auc}')


Training Logistic Regression...
Logistic Regression Accuracy: 0.6792301523656776
Logistic Regression ROC-AUC: 0.5022454250295159

Training Decision Tree...
Decision Tree Accuracy: 0.7363672814755413
Decision Tree ROC-AUC: 0.697580061983471

Training Random Forest...
Random Forest Accuracy: 0.7999198075380914
Random Forest ROC-AUC: 0.8589424070247934

Training Gradient Boosting...
Gradient Boosting Accuracy: 0.8061347233360064
Gradient Boosting ROC-AUC: 0.858036175472255


When we consider the batter and pitcher and the part of the zone the pitch was in, we have marked improvement in random forest, but gradient boosting does not improve. Pre-hit data is not going to be able to consistently predict whether a hit occurs because there is too much variance in outcome. Now we will try to use additional data after contact to improve our model performance such as the location of the ball after it is put into play.

In [7]:
#Load the dataset
data = pd.read_csv('savant_data.csv')

# Remove samples with missing values
data = data.dropna(subset=['launch_angle', 'launch_speed', 'hc_x', 'hc_y', 'events'])

# Select relevant columns
data = data[['launch_angle', 'launch_speed', 'hc_x', 'hc_y', 'events']]

# Create 'is_hit' column based on hit events
hit_events = ['single', 'double', 'triple', 'home_run']
data['is_hit'] = data['events'].apply(lambda x: 1 if x in hit_events else 0)

# Select features and target
features = ['launch_angle', 'launch_speed', 'hc_x', 'hc_y']
X = data[features]
y = data['is_hit']

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize models
models = {
    "Logistic Regression": LogisticRegression(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "Gradient Boosting": GradientBoostingClassifier()
}

# Train and evaluate each model
for model_name, model in models.items():
    print(f"\nTraining {model_name}...")
    model.fit(X_train, y_train)

    # Predict on the test set
    y_pred = model.predict(X_test)

    # Evaluate accuracy
    accuracy = accuracy_score(y_test, y_pred)
    print(f'{model_name} Accuracy: {accuracy}')

    # Evaluate ROC-AUC
    if hasattr(model, "predict_proba"):
        y_prob = model.predict_proba(X_test)[:, 1]
        roc_auc = roc_auc_score(y_test, y_prob)
        print(f'{model_name} ROC-AUC: {roc_auc}')


Training Logistic Regression...
Logistic Regression Accuracy: 0.7394182547642929
Logistic Regression ROC-AUC: 0.7847633770416379

Training Decision Tree...
Decision Tree Accuracy: 0.8651955867602809
Decision Tree ROC-AUC: 0.8483105590062111

Training Random Forest...
Random Forest Accuracy: 0.9165496489468405
Random Forest ROC-AUC: 0.9563670577409707

Training Gradient Boosting...
Gradient Boosting Accuracy: 0.9009027081243731
Gradient Boosting ROC-AUC: 0.9448371750632619


The "hc_x" and "hc_y" attributes show where the ball landed after it was hit, so all models improved based on that. If we can factor in additional factors such as wind or day vs. night, we can calculate trajectories to predict where a ball will land and that allows an analyst to help inform a hitting coach what kind of contact is going to give the batters the best chance at success on a particular day.