# Baseball Swing Coaching Analysis

As a Data Analyst for a professional baseball team, I was tasked with analyzing high-resolution swing data from five batters. The dataset included detailed time-series measurements of joint positions, angles, angular velocities, as well as bat metrics, along with key events like ball release and contact.

The goal of this project was to process the raw data, identify key swing events, and calculate meaningful metrics that could inform hitting coaches about player performance. By translating complex biomechanical data into actionable insights, the analysis bridges the gap between data science and on-field coaching, helping coaches understand timing, power generation, and contact quality without requiring a technical background in biomechanics.

This project demonstrates the application of data processing, event detection, and metric extraction to support performance optimization in baseball.

In [2]:
import pandas as pd
import numpy as np

def load_data(file_path):
        data = pd.read_csv(file_path)
        return data
    
def clean_data(data):
    # Standardize column names
    data.columns = data.columns.str.strip().str.lower().str.replace(' ', '_')

    return data

In [3]:
file_paths = [
    '/Users/Carlos/Desktop/MyData/Blue Jays data/batter1.csv',
    '/Users/Carlos/Desktop/MyData/Blue Jays data/batter2.csv',
    '/Users/Carlos/Desktop/MyData/Blue Jays data/batter3.csv',
    '/Users/Carlos/Desktop/MyData/Blue Jays data/batter4.csv',
    '/Users/Carlos/Desktop/MyData/Blue Jays data/batter5.csv'
]

# Dictionary to store cleaned datasets
cleaned_datasets = {}

for file_path in file_paths:
    # Load the dataset
    data = load_data(file_path)
    
    # Clean the dataset
    cleaned_data = clean_data(data)
    
    # Save cleaned dataset to dictionary with the file name as the key
    file_name = file_path.split('/')[-1].split('.')[0]  # Extracts file name without extension
    cleaned_datasets[file_name] = cleaned_data

In [4]:
def event_rows(datasets_dict):
    rows = {}

    for name, df in datasets_dict.items():
        if df.empty:
            continue

        # Create a list to store exact rows for the current dataset
        event_rows = []

        # 1. Ball release (where time_x == ball_release_x[0])
        if 'ball_release_x' in df.columns:
            ball_release_time = df['ball_release_x'].iloc[0]  # Get the first value of ball_release_x
            event_rows.append(df[df['time_x'] == ball_release_time])

        # 2. Time of peak bat speed
        max_value_index = df['bat_speed_x'].idxmax()
        max_speed_time = df.loc[max_value_index, 'time_x']
        event_rows.append(df[df['time_x'] == max_speed_time])

        # 3. Slowest bat speed after peak bat speed
        data_after_max = df.loc[max_value_index + 1:]
        slowest_bat_speed_index = data_after_max['bat_speed_x'].idxmin()
        swing_end_time = df.loc[slowest_bat_speed_index, 'time_x']
        event_rows.append(df[df['time_x'] == swing_end_time])

        # Concatenate rows into a single DataFrame for the current dataset
        rows[name] = pd.concat(event_rows)

    return rows
    
event_data = event_rows(cleaned_datasets) #run function on cleaned_datasets

In [5]:
def contact(row, dataset_columns):
    # Check if 'contact_x' exists in the dataset columns
    return 'Yes' if 'contact_x' in dataset_columns else 'No'

def sweet_spot_hit(row, tolerance=0.0508):  # Assuming a 2-inch sweet spot

    # Check if 'contact_x' column exists
    if 'contact_x' not in data.columns:
        return 'N/A'

    # Get the contact time
    contact_time = data['contact_x'].iloc[0]
    
    # Filter the dataset for the time of contact
    contact_data = data[data['time_x'] == contact_time]

    # Extract ball position at contact
    ball_hit_x = contact_data['ball_hit_x'].iloc[0]
    ball_hit_y = contact_data['ball_hit_y'].iloc[0]
    ball_hit_z = contact_data['ball_hit_z'].iloc[0]
    
    # Extract sweet spot position at contact
    sweet_spot_x = contact_data['sweet_spot_x'].iloc[0]
    sweet_spot_y = contact_data['sweet_spot_y'].iloc[0]
    sweet_spot_z = contact_data['sweet_spot_z'].iloc[0]
    
    # Calculate Euclidean distance between ball hit point and sweet spot
    distance = np.sqrt(
        (ball_hit_x - sweet_spot_x)**2 +
        (ball_hit_y - sweet_spot_y)**2 +
        (ball_hit_z - sweet_spot_z)**2
    )
    
    # Check if the distance is within the tolerance
    return 'Yes' if distance <= tolerance else 'No'

def stride_length(row):
    if not {'lankle_x', 'rankle_x', 'lankle_y', 'rankle_y', 'lankle_z', 'rankle_z'}.issubset(row.index):
        return np.nan
    return np.sqrt(
        (row['lankle_x'] - row['rankle_x']) ** 2 +
        (row['lankle_y'] - row['rankle_y']) ** 2 +
        (row['lankle_z'] - row['rankle_z']) ** 2)

def torso_pelvis_ratio(row):
    if not {'torso_angular_velocity_x', 'torso_angular_velocity_y', 'torso_angular_velocity_z',
            'pelvis_angular_velocity_x', 'pelvis_angular_velocity_y', 'pelvis_angular_velocity_z'}.issubset(row.index):
        return np.nan

    # Calculate torso angular velocity
    torso_angular_velocity = np.sqrt(
        row['torso_angular_velocity_x'] ** 2 +
        row['torso_angular_velocity_y'] ** 2 +
        row['torso_angular_velocity_z'] ** 2
    )

    # Calculate pelvis angular velocity
    pelvis_angular_velocity = np.sqrt(
        row['pelvis_angular_velocity_x'] ** 2 +
        row['pelvis_angular_velocity_y'] ** 2 +
        row['pelvis_angular_velocity_z'] ** 2
    )

    # Return the torso-to-pelvis ratio
    return torso_angular_velocity / pelvis_angular_velocity

In [6]:
def apply_metrics(rows, metric_functions):
    results = {}

    for dataset_name, event_rows in rows.items():
        # Create a copy of the event rows to add metric columns
        metrics_df = event_rows.copy()

        # Get the column names of the current dataset
        dataset_columns = metrics_df.columns

        # Apply each metric function to every row in the DataFrame
        for metric_name, metric_function in metric_functions.items():
            if metric_name == "contact":
                # Pass dataset columns to the 'contact' function
                metrics_df[metric_name] = metrics_df.apply(metric_function, axis=1, args=(dataset_columns,))
            else:
                metrics_df[metric_name] = metrics_df.apply(metric_function, axis=1)

        # Select only the time and calculated metric columns
        cleaned_df = metrics_df[["time_x"] + list(metric_functions.keys())]

        # Store the cleaned DataFrame for this dataset
        results[dataset_name] = cleaned_df

    return results

In [12]:
metric_functions = {
    "contact": contact,
    "sweet_spot_hit": sweet_spot_hit,
    "stride_length": stride_length,
    "torso_pelvis_ratio": torso_pelvis_ratio
}
final_data = apply_metrics(event_data, metric_functions)
final_data

{'batter1':       time_x contact sweet_spot_hit  stride_length  torso_pelvis_ratio
 602  2.00667     Yes             No       2.378457            1.680733
 729  2.43000     Yes             No       2.830392            3.902356
 768  2.56000     Yes             No       2.728628            1.925812,
 'batter2':       time_x contact sweet_spot_hit  stride_length  torso_pelvis_ratio
 595  1.98333      No             No       2.121699            0.576384
 722  2.40667      No             No       3.038724            7.048533
 758  2.52667      No             No       2.521989            1.715645,
 'batter3':       time_x contact sweet_spot_hit  stride_length  torso_pelvis_ratio
 595  1.98333      No             No       2.121699            0.576384
 722  2.40667      No             No       3.038724            7.048533
 758  2.52667      No             No       2.521989            1.715645,
 'batter4':       time_x contact sweet_spot_hit  stride_length  torso_pelvis_ratio
 599  1.99667    

In [14]:
def generate_table(results):
    formatted_results = []

    for batter, metrics_df in results.items():
        for i, (_, row) in enumerate(metrics_df.iterrows()):
            # Identify the event (based on row order in metrics_df)
            event = ['Ball Release', 'Max Bat Speed', 'End of Swing'][i]

            # Append formatted row to results
            formatted_results.append({
                "Batter": batter,
                "Event": event,
                "Time (s)": row["time_x"],
                "Contact": row.get("contact", "N/A"),
                "Sweet Spot Hit": row.get("sweet_spot_hit", "N/A"),
                "Stride Length (ft)": row.get("stride_length", "N/A"),
                "Torso-Pelvis Ratio": row.get("torso_pelvis_ratio", "N/A")
            })

    # Convert the list of results into a DataFrame
    results = pd.DataFrame(formatted_results)
    return results

Final_table = generate_table(final_data)

Final_table

Unnamed: 0,Batter,Event,Time (s),Contact,Sweet Spot Hit,Stride Length (ft),Torso-Pelvis Ratio
0,batter1,Ball Release,2.00667,Yes,No,2.378457,1.680733
1,batter1,Max Bat Speed,2.43,Yes,No,2.830392,3.902356
2,batter1,End of Swing,2.56,Yes,No,2.728628,1.925812
3,batter2,Ball Release,1.98333,No,No,2.121699,0.576384
4,batter2,Max Bat Speed,2.40667,No,No,3.038724,7.048533
5,batter2,End of Swing,2.52667,No,No,2.521989,1.715645
6,batter3,Ball Release,1.98333,No,No,2.121699,0.576384
7,batter3,Max Bat Speed,2.40667,No,No,3.038724,7.048533
8,batter3,End of Swing,2.52667,No,No,2.521989,1.715645
9,batter4,Ball Release,1.99667,Yes,No,2.342277,0.59788


# Events and Metrics

## Events Chosen

1. **Ball Release:** This event serves as the starting point for analyzing the swing. A well-timed swing often begins in response to the pitcher’s release, making this phase critical for understanding how the batter initiates their timing. By tracking the batter's starting position in terms of stride length and torso-pelvis ratio, coaches can compare these metrics to the next two phases to identify trends or inconsistencies.  
2. **Max Swing Speed:** This event captures the peak bat speed during the swing, which reflects the batter’s ability to generate power. Maximizing bat speed is directly linked to the potential for making impactful contact with the ball, a crucial component of effective hitting.

3. **End of Swing:** The end of the batter’s swing identifies the slowest bat speed after the peak, providing insights into how effectively the batter follows through. A strong follow-through ensures better momentum transfer and improves the quality of contact, while a weak follow-through could indicate energy loss or mechanical inefficiencies.

## Metrics Chosen

1. **Contact:** This metric measures whether or not the batter made contact with the ball. It is fundamental for assessing the batter’s ability to align their swing mechanics with the pitch. For a coach, understanding whether contact was made can reveal timing or swing path issues.
   
2. **Sweet Spot Hit:** Not all contact is created equal, therefore this metric identifies whether the ball was hit on the "sweet spot" of the bat. Hitting the sweet spot is essential for maximizing distance and power, and this metric helps coaches determine whether the batter is making optimal contact or needs to adjust their swing.
   
3. **Stride Length:** This metric measures the distance between the batter’s feet during the stride phase of the swing. A proper stride length contributes to balance, stability, and power generation. A coach can use this metric to determine if the batter is overstriding or understriding, which could lead to loss of power or balance.
   
4. **Torso-Pelvis Ratio:** This metric evaluates the coordination between the upper body (torso) and lower body (pelvis) during the swing. A higher ratio indicates effective rotational mechanics, which are key for generating bat speed and power. Ideally, batters should show a higher ratio during the max swing speed phase, with lower ratios during ball release and the end of the swing to reflect efficient energy transfer and stability.

### Conclusion 

This analysis demonstrates how high-resolution swing data can be translated into actionable insights for coaching. By focusing on key events, namely ball release, maximum swing speed, and the end of the swing, and evaluating metrics such as contact, sweet spot hits, stride length, and torso-pelvis ratio, we can identify trends in timing, power generation, and follow-through. These insights allow coaches to pinpoint mechanical inefficiencies, optimize swing mechanics, and support player development without requiring a technical background in biomechanics. Overall, this approach bridges data science and on-field performance, highlighting the value of quantitative analysis in improving hitting outcomes.