In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os
GOOGLE_DRIVE_PATH = os.path.join('drive', 'My Drive', '451_Project/Proposal')
print(os.listdir(GOOGLE_DRIVE_PATH))

['MLB_451_project_proposal.ipynb']


In [21]:
from google.colab import files
import pandas as pd
uploaded = files.upload()
df = pd.read_csv('stats451.csv')

This project concerns the "MLB Statcast" dataset of all batters in the MLB over the years 2020-2024. This data is made publically avalible by Baseball Savant. To commence the analysis of the "MLB Statcast" dataset, we have imported it into a dataframe, 'df'.

This dataset consists of 733 observations across 18 attributes. The 18 attributes considered are:
*   last_name, first_name
*   player_id
*   year: year the stats were recorded
*   pa: number of plate appearances
*   k_percent: strikeout rate
*   bb_percent: a walk or (base on balls) rate
*   woba: on-base percentage that accounts for how a player reached a base
*   xwoba: expected weighted on- base average
*   exit_velocity_avg: how hard, on average a batter hits the ball
*   launch_angle_avg: indicates, on average, how often a player hits a fly-ball
*   sweet_spot_percent: how often a player produces a batted-ball event in the launch angle sweet-spot zone of 8-32 degrees
*   barrel_batted_rate: rate at which a player hits a ball with an exit velocity of at least 98 mph
*   hard_hit_percent: percentage of balls batted that were hit at 95 mph or more
*   avg_best_speed:
*   avg_hyper_speed
*   whiff_percent: percentage of swing and misses


The aim of this project is to accurately predict Expected Weighted On-base Average (xWOBA) using 17 predicive attributes. xWOBA is considered to give a comprehensive overview of a player's offensive value. By analyzing the relationships between xWOBA and the 17 other attributes considered, we can gain insights into the most influential aspects of a hitter's performance.

$\textbf{Question of interest:}$

Which features (of those considered in this analysis) are most predicive and significant on a player's offensive value, as measure by Expected Weighted On-base Average (xWOBA)?

$\textbf{Methodologies:}$

A high level overview of the methodolgies we hypothesis will be employed in this project:

*   $\textbf{Data Collection}$: Compile a baseball hitters dataset, made publicly available by MLB Baseball Statcast over the years 2020-2024, ensure representation across a diverse range of players.

*   $\textbf{Data Preprocessing}$: Clean the data for missing values, outliers, and ensure normalization of features when necessary. Additionally, ensure each feature included is relevant to the task at hand.

*   $\textbf{Exploratory Data Analysis (EDA)}$:Visualize the datasets and subsets of such to identify correlations and underlying relationships present within and between attributes. Additionally, analyze the distribution of each feature and the target variable, xWOBA.

*   $\textbf{Feature Engineering}$:Based on EDA findings, engineer new features or transform existing ones.

*   $\textbf{Model Selection}$:Evaluate several regression models, such as Linear Regression, SVM, decision tree, and kNN to deduce which methodology is most appropriate for the dataset at hand.

*   $\textbf{Model Evaluation}:$Use cross-validation to assess model performance, focusing on metrics relevant to regression analysis such as RMSE (Root Mean Squared Error). Additionally, including test, and validation datasets.

*   $\textbf{Feature Importance Analysis}$:Utilize model insights to identify the most influential features in predicting xWOBA, providing valuable insights into hitting performance evaluation.


$\textbf{Expected Outcomes}$ :
A model that accurately predicts xWOBA, identifying features most significant on a player’s offensive value, as quantified by xWOBA.
Strategic insights for players and coaches on improving offensive performance based on model findings.





To give a high level overview of the data, the first 10 rows of the dataset are displayed below.






In [36]:
df.head(n=10)

Unnamed: 0,"last_name, first_name",player_id,year,pa,k_percent,bb_percent,woba,xwoba,exit_velocity_avg,launch_angle_avg,sweet_spot_percent,barrel_batted_rate,hard_hit_percent,avg_best_speed,avg_hyper_speed,whiff_percent,swing_percent,flyballs_percent
0,"Cabrera, Miguel",408234,2020,231,22.1,10.4,0.323,0.379,93.2,12.1,36.8,9.7,49.7,102.655113,96.026886,31.6,47.7,21.9
1,"Cruz Jr., Nelson",443558,2020,214,27.1,11.7,0.411,0.383,91.6,9.4,39.4,15.0,47.2,102.72368,95.933078,34.2,47.6,21.3
2,"Peralta, David",444482,2020,218,20.6,6.0,0.333,0.299,89.2,6.4,29.4,5.0,36.3,100.556637,94.354591,21.1,46.6,18.8
3,"Longoria, Evan",446334,2020,209,18.7,5.3,0.308,0.364,91.7,10.7,29.9,11.5,45.2,101.53026,95.520896,21.0,45.0,25.5
4,"Cabrera, Asdrúbal",452678,2020,213,18.8,8.9,0.319,0.317,89.5,13.7,30.5,6.5,38.3,97.982869,93.323023,20.5,46.1,26.6
5,"Blackmon, Charlie",453568,2020,247,17.8,7.7,0.34,0.331,86.9,13.5,38.5,4.9,29.7,97.303074,92.719757,23.3,50.2,24.7
6,"Solano, Donovan",456781,2020,203,19.2,4.9,0.357,0.32,88.5,15.5,43.4,4.6,34.9,98.263816,93.264045,22.0,51.9,22.4
7,"McCutchen, Andrew",457705,2020,241,19.9,9.1,0.327,0.358,89.7,18.2,36.5,8.2,41.2,99.227581,93.968213,22.0,40.6,27.1
8,"Votto, Joey",458015,2020,223,19.3,16.6,0.347,0.369,87.4,15.4,36.4,9.1,35.7,98.397389,93.296096,22.5,36.2,37.8
9,"Santana, Carlos",467793,2020,255,16.9,18.4,0.316,0.372,88.0,12.2,32.3,6.7,36.6,99.260366,93.650599,20.6,36.3,26.2
