# Project Summary: Predicting Pass vs. Run Plays Using Pre-Snap Data

Metric Track | Author: 
[Max Fishman](https://www.linkedin.com/in/max-fishman-7a0847b3/) <br>

# Introduction

**How does a defense prepare for play in real-time or in practice? How do defensive coordinators decide when to deploy man coverage versus zone? How do defenders adapt their approach based on the offense’s lineup?**  

Traditionally, these answers come from hours of film study, uncovering trends in an opponent's offensive strategy. But what if we could streamline this process? What if, instead of exhaustive manual analysis, we could rely on a data-driven model to predict play outcomes with precision?  

This project explores that possibility: using pre-snap features to accurately predict whether a play will be a run or a pass. By analyzing an opposing offense’s pre-snap lineup and tendencies, this model has the potential to revolutionize defensive preparation, enabling teams to anticipate plays and refine their strategy more effectively. Imagine the tactical advantage this could offer—reducing film study time while enhancing practice efficiency and in-game adjustments.  

This predictive approach was developed through three key steps:  
1. **Data Processing** – Aggregating and preparing pre-snap data from multiple sources.  
2. **Feature Engineering** – Designing and selecting meaningful metrics such as motion types, defensive alignments, and field context.  
3. **Model Training** – Building and optimizing a logistic regression model to predict play type based on the engineered features.  

This framework not only demonstrates the feasibility of play prediction but also highlights the untapped potential of data science in football strategy.


# I. Data Processing

### a. Maniuplating and Aggregating Meaningful Data
The foundation of this project begins with clearly defining and sorting the dataset to include only run and pass plays. By isolating these play types, I ensure that the focus remains on the core task: predicting whether a play is a run or pass. 

#### Classifying Run and Pass Plays
A play is classified as a pass or run if the following conditions hold:

$$
\text{playType} = 
\begin{cases} 
\text{"run"(0)} & \text{if } \text{hadRushAttempt} = 1 \\
\text{"pass"(1)} & \text{if } \text{passResult} \in \{ 'C',  'I',  'IN' \} 
\end{cases}
$$

A play is determined a pass when the `passResult` is equal to a completion (**C**), incomplete (**I**), or intercepted (**IN**) — ommitting sacks and scrambles. And a run play is assigned if the `hadRushAttempt` variable is set to 1. Once this filtering is complete, I merge relevant features from the provided datasets, such as tracking data, play-by-play information, and game context. These formulas ensure that each play is categorized based on the corresponding game and player data. The classification logic is foundational for defining the dataset used in later stages of the analysis.


To further enhance the dataset, I incorporate additional metrics from my Formal Play Score (FPS) dataframe, which I developed to quantify situational factors such as field position and game score impact. I'll discuss this logic and other feature engineering tactics in the next section. At this stage, however, we have all the relevant play-level data merged together from the provided datasets.

### b. Designing Data Pipeline 
The data pipeline combines, cleans, and transforms various datasets to create structured inputs for model training and testing. Below are the major steps:

#### Data Loading and Filtering:
- Run and pass plays are identified using `aggregate_play_types` by merging key datasets like `plays.csv`, `games.csv`, and `player_play.csv`.
- Play-level metrics are aggregated and enriched with FPS data to include contextual features like field position and score impact.

#### Tracking Data Processing:
For each week (1–9):
- Pre-snap tracking data is filtered and cleaned to standardize player positions and orientations. Invalid entries are removed, and memory usage is optimized using downcasting.
- **Defensive Features**: Calculations include the number of defenders in the box, primary matchups, and mismatches.
- **Offensive Features**: Metrics such as pre-snap motion categories, motion player count, and time-to-snap are derived.

##### Merging and Transformation:
- Weekly defensive and offensive features are combined with aggregated play-level data using `final_merge`.
- Categorical features (e.g., run/pass, formations) are encoded into dictionaries for model compatibility.

##### Training and Testing Dataset Creation:
- Data from weeks 1–8 is shuffled and labeled as the training set, while week 9 forms the testing set.
- Final outputs include `training_data`, `testing_data`, and their corresponding targets (`train_labels`, `test_labels`), saved as CSV files.


# II. Feature Engineering

### a. Formal Play Score Weights
The Formal Play Score (FPS) was designed to quantify the impact of a play on the game by integrating field position, game score scenario, and play success metrics. This process involved significant data manipulation, resulting in a large feature-rich play-level dataset. Although the score did not directly improve the model’s predictive accuracy, other engineered features—such as `fieldPositionWeight`, `scoreDifferential`, and `gameQuarterWeight`—proved valuable, making the FPS dataframe a crucial part of the data pipeline. For more detailed information on the logic behind FPS and its components, refer to [Section A. Aggregating Play-Level Insights (FPS)](#a-aggregating-play-level-insights-fps) in the Appendix.


### b. Tracking Features 
These features were designed to enrich the dataset by capturing insights into player movement, motion types, and field positioning during pre-snap frames. Kinematic data can be particularly valuable for uncovering real-time play-level insights. The tracking features were specifically engineered for both **offense** and **defense**. 

##### Offensive Features
- **Players in motion pre-snap** (count)
- **Motion categorization** (6 unique classes)
- **Pre-snap time duration** (time)
  
The development of these offensive motion metrics were centered around identifying frames where an eligible offensive player was in motion before the snap. This process involved setting specific thresholds to capture events like `breaking-from-huddle`, `lined-up`, and `in-motion`, based on speed (**s**) and field position (**x, y**). Essentially, we used local maxima and minima in the tracking data—frames where player speed exceeded a defined threshold were tagged as `in-motion`. From this, we could determine the number of players in motion and categorize the type of motion (e.g. No Motion, Single Player in Motion, Multi-Player Shift, etc.) more accurately than the data provided.

The example below (*Figure 1.*) demonstrates how motion detection works for Philadelphia Eagles WR Quez Watkins. From analyzing the play in my animation program (*Figure 2.*), we can see Watkins breaks from the huddle to the right side of the offensive line, then motions to the left (with a defender following). He comes to a stop behind AJ Brown (classifying a `lined-up` state, with **s** dropping below the threshold for multiple frames) before motioning back to the right side (defender passes assignment) just as the ball is snapped.




<div style="display: flex; align-items: center; justify-content: center;">
    <div style="margin-right: 20px;">
        <img src="https://raw.githubusercontent.com/TechMax14/BIG-DATA-BOWL-25/main/visuals/PHI%20Speed%20Plot%20Presnap/quez_watkins_motion.gif?raw=true" 
             style="width: 600px; border: 1px solid #ddd; border-radius: 8px;"/>
        <p style="text-align: center; font-weight: bold;">Quez Watkins Speed Plot</p>
        
<center>Figure 1. Motion Frames for Quez Watkins (gameId: 2022091104, playId: 713).</center>
<br>
    </div>
    <div>
        <img src="https://raw.githubusercontent.com/TechMax14/BIG-DATA-BOWL-25/main/visuals/Animation%20Gifs/presnap_animation_play713_game2022091104.gif?raw=true" 
             style="width: 800px; border: 1px solid #ddd; border-radius: 8px;"/>
        <p style="text-align: center; font-weight: bold;">Presnap Animation GIF</p>
        
<center>Figure 2. Pre-snap Animation for play: (gameId: 2022091104, playId: 713).</center>
<br>
    </div>
</div>





The `player_plays` dataframe only assigns the `inMotionAtBallSnap` and `motionSinceLineSet` variables to `TRUE` in regards to Watkins on this play, but with my approach, we can classify this play as **single-combined**—indicating a single player shifted pre-snap AND was in motion at the ball snap. Additionally, this method allows for the detection of more specific motion types, such as jet motion, orbit motion, and yo-yo motion (likely the case here with Watkins). 

On another play, DK Metcalf is tagged with `motionSinceLineSet` equal to `TRUE`, but our logic identified that he was not a valid candidate for motion pre-snap under the given criteria. For a deeper explanation of this methodology and its advantages in similar scenarios,, refer to [Section B. Offesnive Tracking Motion Features](#b-Motion-Detection-and-Categorization) in the Appendix.



##### Defensive Features
- **Players in the box** (count)
- **Defensive assignment mismatches** (count)
  
The defensive features focus on evaluating player positioning, spatial awareness, and coverage effectiveness. These tools analyze defensive player movements, their proximity to critical areas on the field, and potential mismatches in coverage.

The key objectives are to:

1. **Analyze Spatial Coverage**: Define important areas around the ball and identify which defenders are present within these zones during a play pre-snap.
2. **Evaluate Defensive Mismatches**: Detect situations where defenders are at a disadvantage based on their assigned matchups, such as linebackers covering faster wide receivers or running backs.
3. **Map Coverage Assignments**: Link defenders to their assigned offensive players to understand individual matchups and overall coverage strategies.

By combining tracking data with positional information, these features provide insights into how defensive players are distributed, their coverage effectiveness, and areas where the offense may exploit mismatches. This analysis is valuable for understanding defensive schemes and optimizing player assignments.


### C. Feature Selection
The feature selection process began after creating the training dataframes through the data pipeline, resulting in an initial set of 27 features. From there, a thorough selection process was conducted, leveraging a correlation heatmap to assess relationships with the target variable (`play_type`) and evaluating Variance Inflation Factors (VIF) to minimize multicollinearity among features.

#### Model Features

We selected 13 features as model inputs:
1. Offensive Formation
2. Reciever Alignment
3. Motion Categorization
4. Player Motion at Ball Snap
5. Pre-snap Time Duration
6. Devensive Mismatches
7. Devensive Players in the Box Pre-snap
8. Sore Differential
9. Game Quater Weight
10. Current Down
11. Yards to Go for a 1st Down
12. Field Position Weight
13. Abolute Yard Line 


The 13 features selected as inputs to the model can be categorized to provide a comprehensive understanding of play dynamics and improve predictive accuracy. **Kinematic features** (3-7), derived from tracking data capture player movements and spatial interactions that are critical for understanding pre-snap intent, player motion and responsibilities, and potential mismatches. **Game and play scenario metrics** (8-13), contextualize each play within the broader game situation, reflecting decision-making pressure and strategies based on the score, down, and distance. Finally, **Formation and alignment features** (1-2), including Offensive Formation and Receiver Alignment, provide insight into team strategies and how personnel are positioned to execute plays, which can influence defensive reactions. Together, these features balance player-specific, situational, and spatial aspects, enhancing the model's ability to capture the nuances of football plays.  


# III. Model Training

### A. Model Selection
To identify the best-performing model for predicting play type (pass vs. run), we evaluated two approaches: Logistic Regression (LogReg) and Random Forest (RF). Both models demonstrated strong performance, as shown in their classification reports and confusion matrices. GridSearchCV was used to tune hyperparameters for both models, ensuring optimal configurations. Logistic Regression's simplicity and interpretability were advantageous, particularly with its ability to assign feature importance through coefficients. On the other hand, Random Forest, a more complex ensemble model, provided robust predictions by leveraging decision trees and feature interactions.

##### Training Model Inputs:
- **RUN (0):** 8,728
- **PASS (1):** 6,788
- **Feature:** 13


##### Training Model Results:
- **Logistic Regression (LR)** 
    - Accuracy: **88.6%**
    - ROC-AUC: **92.4%**
        - **RUN** (0):
            - Precision: **83%** | Recall: **94%** | F(1) Score: **0.88** | Support: **1230**
        - **PASS** (1):
            - Precision: **94%** | Recall: **85%** | F(1) Score: **0.89** | Support: **1578**
              
- **Random Forest (RF)**
    - Accuracy: **89.4%**
    - ROC-AUC: **93.4%**
        - **RUN** (0):
            - Precision: **84%** | Recall: **93%** | F(1) Score: **0.88** | Support: **1230**
              
        - **PASS** (1):
            - Precision: **94%** | Recall: **87%** | F(1) Score: **0.90** | Support: **1578**
              
![Logistic Regression vs Random Forest Confusion Matrix](https://raw.githubusercontent.com/TechMax14/BIG-DATA-BOWL-25/main/visuals/Model%20Results/LRvsRF_hyptertuned_confusion_matrix.png)
<center>Figure 3. Logistic Regression vs Random Forest Hypertuned Confusion Matrix Comparison.</center>
<br>

The confusion matrix comparison (*Figure 3.*) illustrates that both models achieve high accuracy, correctly predicting run and pass plays in the high 80% range. Logistic Regression (LR) and Random Forest (RF) both exhibit strong performance, with LR achieving an accuracy of **88.6%** and RF slightly higher at **89.4%**. The ROC-AUC scores reflect similar trends, with LR scoring **92.4%** and RF slightly surpassing it with **93.4%**.

Ultimately, Logistic Regression outperformed Random Forest on the testing data, achieving a slightly better accuracy and ROC-AUC score. While Random Forest excelled in scenarios with complex decision boundaries, Logistic Regression's linear nature aligned better with our dataset, offering consistent predictions without overfitting. This made Logistic Regression the preferred model, as it provided both high accuracy and transparency for understanding the impact of each feature on the model's predictions. Additionally, its lower computational complexity ensured efficient training and scalability for future applications.


### b. Model Performance
The Logistic Regression model achieved remarkable performance in predicting play types using pre-snap features. With an accuracy of **88.6%** and a ROC-AUC score of **92.4%**, it successfully captured the nuances of pre-snap data to distinguish between run and pass plays. The model's precision and recall values highlight its strength, especially in predicting pass plays, where it reached a precision of **94%** and a recall of **85%**. 

Comparing this to the Random Forest model, Logistic Regression provided slightly more consistent results while maintaining interpretability, an essential feature for practical implementation in coaching strategies. The confusion matrix analysis further corroborates its reliability, with minimal misclassification of runs as passes or vice versa.

Despite its strong performance, there is room for refinement. For instance, improving recall for pass predictions (85%) could enhance the model's reliability in high-pressure game situations where passes are more frequent.

<img src="https://raw.githubusercontent.com/TechMax14/BIG-DATA-BOWL-25/main/visuals/Model%20Results/test_data_confusion_matrix_logreg.png" width="400"/>
<center>Figure 4. Pass vs Run Model Performance on Testing Data, showing the classification report (left) and confusion matrix (right).</center>
<br>

# IV. Model Analysis
##### Team Predictability by Down
This heatmap illustrates the predictability trends of NFL teams' play-calling across different downs, based on the absolute deviation of predicted probabilities from 0.5 (indicating randomness). Each line represents a team, showing how their play-calling predictability varies from 1st to 4th down. Teams with higher values indicate more predictable decisions, while lower values suggest less predictable play-calling. Noticeable trends include generally lower predictability on 1st down, with a significant increase on 4th down for most teams, likely reflecting conservative or situation-driven decisions (e.g., punts or field goals). Variability between teams highlights differences in strategic approaches across downs.

<img src="https://raw.githubusercontent.com/TechMax14/BIG-DATA-BOWL-25/main/visuals/Model%20Results/team_predictability_by_down.png" width="400"/>
<center>Figure 5. Play Type Predictability by Down per Team.</center>
<br>

# Conclusion:
This project demonstrated the potential of using pre-snap data to predict run and pass plays, offering a data-driven solution to aid defensive preparation in football. By focusing on 13 meticulously engineered features, the model provides insights into offensive tendencies, enabling teams to optimize their in-game strategies.

Logistic Regression emerged as the preferred model due to its accuracy, interpretability, and computational efficiency. It successfully translated pre-snap patterns into actionable insights, showcasing the role of data science in sports strategy. With an **88.6% accuracy rate**, the model not only highlights the feasibility of play prediction but also lays the groundwork for more sophisticated applications in the future.

However, this project also underscores the challenges of working with noisy and highly contextual sports data. The lessons learned here provide a foundation for future iterations, aiming to refine predictions and expand the scope of analysis.

---

# Future Improvements:
While this model shows promise, several avenues exist for further development:

1. **Expanding Feature Set**:
   Incorporating more advanced tracking data, such as individual player acceleration and deceleration rates, could reveal deeper insights into play dynamics. Features like offensive line spacing or quarterback tendencies pre-snap may also improve prediction accuracy.

2. **Exploring Advanced Models**:
   While Logistic Regression performed well, future iterations could explore deep learning models such as Recurrent Neural Networks (RNNs) or Transformer-based architectures to capture temporal dependencies in pre-snap movements.

3. **Model Generalization**:
   The current model is trained on a limited dataset (Weeks 1–8 for training, Week 9 for testing). Expanding the dataset to include additional weeks or seasons could improve its robustness and generalizability across teams and game situations.

4. **Game Situation Context**:
   Incorporating additional contextual factors like weather conditions, player injuries, or coaching tendencies could add another layer of depth to the predictions.

5. **Real-Time Predictions**:
   Developing a real-time prediction system that analyzes pre-snap data during games could provide actionable insights to coaches, offering a tangible impact on game-day strategy.

By addressing these improvements, this project can evolve into a powerful tool for football teams, blending data science with game strategy to stay ahead of the competition.

---

# Acknowledgment

I want to express my heartfelt gratitude to the competition hosts for organizing such an incredible event. Specifically, I’d like to thank **Thompson Bliss** for being so responsive on the discussion board and for taking the time to answer my questions—it made a huge difference throughout the process. I also want to acknowledge **Michael Lopez** for all he does to make this competition such a valuable and enriching experience for everyone involved. Thank you to Ally Blake, Paul Mooney, and Addison Howard as well!

Participating in the NFL Big Data Bowl 2025 has been an incredible experience for me personally. This project was an opportunity to prove to myself that I could produce something insightful and take a step closer to becoming a sports data analyst one day. It fast-tracked my data science skills, expanded my learning experience, and deepened my understanding of a game I watch every Sunday and truly love.

Looking back, while my achievements in this project may seem modest, my submission was the result of several months of effort, countless late nights, and an immense amount of dedication. I am proud of my results, and I applaud all the contestants for another fantastic year of innovation and analysis.  

Thank you again to the competition hosts for this amazing opportunity—I look forward to seeing how the competition evolves in the future!


---
**Submitted**: 01/06/2025   
**Code:** [GitHub Repo](https://github.com/TechMax14/BIG-DATA-BOWL-25)


# Appendix

## A. Aggregating Play-Level Insights (FPS)
To begin, my focus was on creating a robust play score by aggregating various metrics into play-level insights. The goal was to combine data on field position, game score, game scenario, and play success to generate a `formal play score` (FPS). This score aimed to quantify the impact of each play based on these contextual factors, allowing for a more holistic understanding of its relevance.

Initially, I intended to correlate pre-snap motion patterns with play success, hypothesizing that analyzing motion could reveal valuable insights into offensive strategies and game impact. However, the correlation did not yield the expected results in terms of direct impact on predictive accuracy. Despite this, the FPS data still provided a solid foundation for the dataset, contributing to the overall feature engineering process. While it didn’t directly boost the model’s performance, the play score helped in shaping and refining the features used for training, ultimately playing a supporting role in improving the model’s effectiveness.

Let's dive into this process:

### Field Position Weights

#### Field Position Categories:

The field position categories are assigned based on the `yardlineNumber` and `yardlineSide`. Categories like **Neutral**, **RedZone**, **FieldGoalRange**, etc., are used to evaluate how favorable the field position is for the offense. These categories are as follows:

- **Heaven**: The opponent's 1-yard line.
- **GoldZone**: Between the opponent's 2nd and 10-yard line.
- **RedZone**: Between 10 and 20 yards from the opponent's end zone.
- **FieldGoalRange**: Between 20 and 45 yards from the opponent’s end zone.
- **Decent**: Between the opponent's 45 and your own 45-yard line, solid field position.
- **Neutral**: Between your own 25 and 45-yard line.
- **Bad**: Between your own 10 and 25-yard line.
- **Horrible**: Between the 2nd and 10-yard line on your own side of the field.
- **Hell**: On your own 1-yard line.

The categories are mapped as:

$$
\text{field\_position} = \text{Category from list: } \{ \text{Heaven}, \text{GoldZone}, \text{RedZone}, \text{FieldGoalRange}, \text{Decent}, \text{Neutral}, \text{Bad}, \text{Horrible}, \text{Hell} \}
$$

#### Field Position Weight Calculation:

Each field position category is assigned a weight based on how favorable it is for the offense. A higher weight indicates better field position, which means a higher likelihood of scoring. The following formula is used to calculate the field position weight for each category:

$$
\text{field\_position\_weight} =
\begin{cases} 
2.5 + \text{expectedPoints} \times 0.3 & \text{if position = Heaven} \\
2.2 + \text{expectedPoints} \times 0.2 & \text{if position = GoldZone} \\
2.0 + \text{expectedPoints} \times 0.2 & \text{if position = RedZone} \\
1.7 + \text{expectedPoints} \times 0.1 & \text{if position = FieldGoalRange} \\
1.2 & \text{if position = Decent} \\
1.0 & \text{if position = Neutral} \\
0.8 & \text{if position = Bad} \\
0.5 & \text{if position = Horrible} \\
0.2 & \text{if position = Hell} \\
\end{cases}
$$

#### Breakdown of Field Position Weights:

- **Heaven (Weight = $2.5 + \text{expectedPoints} \times 0.3$)**:
  - This is the best field position, as it places the offense just one yard away from the opponent's end zone. The weight is the highest, reflecting the offense's great scoring opportunity.

- **GoldZone (Weight = $2.2 + \text{expectedPoints} \times 0.2$)**:
  - The offense is within striking distance, between the opponent's 2-yard line and 10-yard line (often "first and goal" scenarios). This is a strong field position, but slightly less favorable than "Heaven."

- **RedZone (Weight = $2.0 + \text{expectedPoints} \times 0.2$)**:
  - The offense is between the opponent's 10-yard line and 20-yard line. This is still a good scoring opportunity, but slightly harder than the previous two categories.

- **FieldGoalRange (Weight = $1.7 + \text{expectedPoints} \times 0.1$)**:
  - The offense is between the opponent’s 20-yard line and 45-yard line. While the offense can potentially add to there score with a FG, more plays will be needed to score a touchdown. The weight is lower, but still indicates a solid field position.

- **Decent (Weight = $1.2$)**:
  - The offense is between the opponent's 45-yard line and their own 45-yard line. This is solid field position and allows for potential long drives, though not a guaranteed scoring opportunity.

- **Neutral (Weight = $1.0$)**:
  - The offense is between their own 25-yard line and 45-yard line. This is a neutral field position, typically where a team may assume starting field position following a kickoff, where neither team has a clear advantage.

- **Bad (Weight = $0.8$)**:
  - The offense is between their own 10-yard line and 25-yard line. This is a less favorable field position, making it more difficult to reach scoring range.

- **Horrible (Weight = $0.5$)**:
  - The offense is between their own 2-yard line and 10-yard line. This is a poor field position that greatly increases the difficulty of advancing the ball

- **Hell (Weight = $0.2$)**:
  - The offense is on their own 1-yard line. This is the worst field position, significantly limiting the chances of scoring and presenting high risk for the offense.

Each weight is adjusted by the **expectedPoints** value (a metric that estimates the points likely to be scored from a given position). While not directly tied to the pass-run prediction model, expectedPoints plays a key role in evaluating field positions, helping to assess the potential value of different starting points. These weights are essential in understanding the strategic importance of field position, providing valuable context for analyzing offensive scenarios.


### Game Score Scenario Weight

#### Score Differential:
The **score differential** is calculated from the perspective of the possession team. It reflects the difference between the home team's score and the visitor team's score before the snap:

$$
\text{scoreDifferential} =
\begin{cases} 
\text{preSnapHomeScore} - \text{preSnapVisitorScore} & \text{if possessionTeam is home team} \\
\text{preSnapVisitorScore} - \text{preSnapHomeScore} & \text{if possessionTeam is away team}
\end{cases}
$$

This calculation shows the point difference between the two teams as seen from the perspective of the team holding possession.

#### Game Quarter Weight:
Different quarters and game situations (such as overtime and the 2-minute warning) are assigned different weights. The **game quarter weight** reflects the importance of the game time for the current play:

$$
\text{gameQuarterWeight} =
\begin{cases} 
1.0 & \text{if quarter = 1} \\
1.2 & \text{if quarter = 2 and gameClock} > 120 \text{ seconds} \\
1.4 & \text{if quarter = 2 and gameClock} \leq 120 \text{ seconds} \\
1.5 & \text{if quarter = 3} \\
2.0 & \text{if quarter = 4 and gameClock} > 120 \text{ seconds} \\
2.4 & \text{if quarter = 4 and gameClock} \leq 120 \text{ seconds} \\
2.5 & \text{if quarter = 5 (OT)} 
\end{cases}
$$

The game quarter weight adjusts based on the timing and significance of the game quarter, including the heightened importance of overtime and the 2-minute warning.

#### Game Score Scenario Weight Calculation:
The **game score scenario weight** is determined by combining the 'base weight' and the 'penalty factor', which adjusts the importance of a play based on the score differential and the timing of the game (via the quarter weight):

$$
\text{gameScoreScenarioWeight} = \text{baseWeight} \times \text{penaltyFactor}
$$

Where the **penaltyFactor** is calculated as:

$$
\text{penaltyFactor} = \max \left( 1.0 - \left| \text{scoreDifferential} \right| \times \frac{40}{\text{gameQuarterWeight}}, 0.1 \right)
$$

<h4>Explanation of the Penalty Factor:</h4>

- **Score Differential**: The penalty factor starts with a value of 1.0 and decreases as the absolute score differential grows. This means the larger the gap between the teams (whether the possession team is ahead or behind), the less weight a particular play will have in the model. This reflects the idea that in blowout scenarios, individual plays have a reduced impact on the overall outcome.

- **Scaling by 40**: The score differential is divided by 40, which provides a scaling factor that reflects the significance of large score gaps. For example, a 40-point lead or deficit will have a large reduction in play importance, while a smaller differential will have a less drastic impact.

- **Game Quarter Weight**: The game quarter weight further modifies the penalty factor, accounting for the timing of the game. As the game progresses into later quarters or enters critical moments such as overtime or the 2-minute warning, the `gameQuarterWeight` increases, effectively lowering the penalty factor. This means that in situations like a late-game comeback or overtime, plays become more impactful despite a significant score differential.

- **Minimum Threshold of 0.1**: The penalty factor is capped at a minimum of 0.1. This ensures that even in extreme score differential scenarios (such as a large blowout), plays never lose all their relevance, as there's always some weight retained, maintaining the integrity of the model.


---

### Play Success Weight

#### Situational Success:
A play is considered successful based on the **down** and **yardsGained**. The conditions for success are as follows:

$$
\text{situationalSuccess} =
\begin{cases} 
\text{True} & \text{if down = 1 and yardsGained} \geq 4 \\
\text{True} & \text{if down = 2 and yardsGained} \geq \max \left( \frac{\text{yardsToGo}}{2}, 3 \right) \\
\text{True} & \text{if down = 3 or 4 and yardsGained} \geq \text{yardsToGo} \\
\text{False} & \text{otherwise}
\end{cases}
$$

This calculation evaluates whether the play gained the required number of yards based on the current down and distance. Such that, on 1st down, the play is considered successful if it gains 4 or more yards. On 2nd down, the play is successful if it gains at least half of the required yards, but no less than 3 yards. On 3rd or 4th down, the play must gain the full required yards to be considered successful. This logic ensures that each play's success is evaluated within the context of the game’s progression and the situation at hand.

#### Play Weight Adjustments:
If the play is successful, its weight is boosted; if not, it is reduced. The **situationalBoost** is calculated as follows:

$$
\text{situationalBoost} =
\begin{cases}
1.5 & \text{if situationalSuccess is True} \\
0.5 & \text{if situationalSuccess is False}
\end{cases}
$$

#### Big Play Weight (Yardage-Based):
The **big play weight** is assigned based on how many yards were gained on the play:

$$
\text{bigPlayWeight} =
\begin{cases} 
2.0 & \text{if yardsGained} \geq 50 \\
1.5 & \text{if yardsGained} \geq 25 \\
1.2 & \text{if yardsGained} \geq 10 \\
1.0 & \text{if yardsGained} < 10
\end{cases}
$$

This weight rewards plays that gain larger amounts of yardage.

#### Apply Penalties and Bonuses:
Additionally, penalties are applied if there was a **fumble lost** or **interception**. The penalty for each event is:

$$
\text{fumblePenalty} =
\begin{cases}
0.5 & \text{if fumbleLost = 1} \\
1.0 & \text{otherwise}
\end{cases}
$$

$$
\text{interceptionPenalty} =
\begin{cases}
0.5 & \text{if passResult = 'IN'} \\
1.0 & \text{otherwise}
\end{cases}
$$

#### Final Play Success Weight:
The final **play success weight** combines all the factors into a final value:

$$
\text{finalPlaySuccessWeight} = \text{expectedPointsAdded} \times 2.0 + \text{situationalBoost} + \text{bigPlayWeight} \times \text{fumblePenalty} \times \text{interceptionPenalty}
$$

The **play success weight** integrates all factors influencing play performance, including the success based on down and yardage, yardage achieved, penalties for turnovers (such as fumbles and interceptions), and rewards for big plays. This comprehensive metric calculates the overall impact of a play, factoring in both positive and negative outcomes, and adjusts the weight accordingly to reflect the true significance of the play within the context of the game.

---

### Formal Play Score

The **formal play score (FPS)** is the product of the field position weight, game score scenario weight, and play success weight:

$$
\text{FPS} = \text{field\_position\_weight} \times \text{gameScoreScenarioWeight} \times \text{finalPlaySuccessWeight}
$$

The FPS is a comprehensive metric designed to evaluate the overall impact of a play by integrating critical factors such as field position, the current game score scenario, and the success of the play itself. This score was developed to assist in predicting the success and impact of plays, particularly with respect to pre-snap motion tendencies. Although the results were inconclusive in terms of gaining valuable insights into motion tendencies, the FPS did align with logical expectations: high FPS values were often tied to touchdowns or significant gains in high-stakes scenarios, such as end-of-game situations with close score margins. Conversely, low FPS values were linked to turnovers or plays with minimal yardage gains in critical moments.

#### High FPS Example:
For example, in the **2022100205** game (Seattle Seahawks vs. Detroit Lions, Week 4), a **high FPS** score of **44.48** was assigned to a touchdown play by Rashaad Penny. On 3rd down with 5 yards to go, with only 2:22 left in the 4th quarter and a 3-point game, Penny broke off a **41-yard touchdown run**. This play occurred with a field position weight reflecting the **FieldGoalRange** and a **gameScoreScenarioWeight** that accounted for the critical time and score situation. The play was successful, and the significant gain had a high impact on the game, as the score differential and timing made the play extremely important.

#### Low FPS Example:
On the other hand, in the **2022091500** game (Los Angeles Chargers vs. Kansas City Chiefs, Week 2), a **low FPS** score of **-100.44** was assigned to a play where Justin Herbert’s pass was intercepted by Ja'Keenan Watson, resulting in a 99-yard interception return for a touchdown. Despite the field position being in the **GoldZone**, the **play success weight** was heavily penalized by the interception, which drastically reduced the FPS. The **score differential** (a tie game) and **game quarter weight** contributed to a minimal field position weight, but the turnover created a major negative impact on the final FPS.

While this FPS dataset was primarily used for aggregating play-level info into the training features dataset, the **field_position_weight** and **gameQuarterWeight** factors proved valuable during feature selection and contributed significantly to the current model, which is able to predict **play_type** with about 90% accuracy (pass or run). Future refinements and applications of this approach could yield further insights and improvements in predictive modeling for football strategy.

#### FPS Insight Example: Average Yards Gained vs. Motion Presence per NFL Team

This figure plots the average yards gained by each NFL team when the team uses pre-snap motion (either shift or pre-snap motion) compared to when there is no motion. Figure 2. tells us that teams above the regression line tend to gain more yards on average when in motion. You can see Buffalo gains the most yards with motion, with **+0.7** yards gained when motion is invovled. Whereas, the Cardinals gain (lose) **-0.9** yards when in motion compared to no motion, averaging the lowest **4.4** yards with motion but averaging **5.3** with no motion, which is close to average. The visualization provides an interesting comparison across teams, highlighting how motion can impact yardage gains on average. It also serves as a minor example of the potential insights that can be derived from the FPS dataset. These insights could be valuable for future analyses, offering opportunities for deeper exploration into motion patterns and their effects on play outcomes.

The visual is created as a responsive hover visual, which allows you to interact with the data by hovering over individual points to see more detailed information about each team's performance. However, I am currently having issues with the link displaying the interactive plot, but I will get it up as soon as I can.

[View Plotly Figure](https://raw.githubusercontent.com/TechMax14/BIG-DATA-BOWL-25/main/visuals/FPS%20Visuals/team_comparison_avgYrds_vs_Motion.html)

![Avg Yards vs Motion Presence](https://raw.githubusercontent.com/TechMax14/BIG-DATA-BOWL-25/main/visuals/FPS%20Visuals/team_comparison_avgYards_vs_Motion.png)
<center>Figure 5. Team Avg. Yards Comparison for Motion vs No Motion .</center>
<br>


## B. Offesnive Tracking Motion Features
### Step 2: Focusing on Tracking Data and Motion Detection

After receiving valuable feedback from the host, I pivoted towards leveraging the tracking data for its predictive potential. The goal was to analyze player motion data before the snap, enabling a deeper understanding of offensive and defensive shifts. By focusing on speed, local maxima, and minima, I was able to accurately detect motion frames and categorize player movements in a way that the original data did not fully capture.

I developed a robust process for identifying and categorizing motion frames by slicing the tracking data to only include pre-snap frames. Using the `process_motion_frames()` function, I applied several motion thresholds to determine when players were in motion:

- **Resting state**: Players were considered to be in a resting state if their speed was less than 0.5.
- **Motion state**: Players were considered in motion if their speed exceeded 1.5.

The key innovation in my approach was detecting motion more accurately than the original data. For instance, a player like DK Metcalf, who was flagged as having a “shiftSinceLineset” in the original data, was actually classified as not having a true shift or motion according to my more precise function. The function identified that he didn’t meet the criteria for motion, and upon reviewing the play animation, it became clear that he didn't actually shift or move significantly before the snap.

Here’s how my `process_motion_frames()` function works:

1. **Local Minima and Maxima Detection**:  
   I first identified local minima (the lowest points in the speed data) and maxima (the highest points in the speed data) by analyzing the speed of each player frame-by-frame. This process focused on frames that occurred after 30% of the total frames to ensure the detection was meaningful (i.e., not influenced by initial setup movements). Local minima and maxima help in identifying when players start and stop their motion.

2. **Motion Segmentation**:  
   Once the minima and maxima were detected, I created motion segments by capturing all frames between each minima and its subsequent maxima. If the speed dropped below the threshold (0.5) after the maxima, the motion was considered to end, but if the speed remained high, the motion continued until the last frame.

3. **Motion Direction Detection**:  
   The function also identifies the direction of the motion, adding another layer of information that can be used to understand player movements more comprehensively.


With this enhanced motion detection, I was able to determine more complex motion behaviors, such as when a player shifts and then motions as the ball is snapped. This level of granularity was crucial for improving the model’s understanding of pre-snap movements and the dynamics of offensive and defensive positioning. For example, in the case of Quez Watkins, you can see how this method captured his nuanced pre-snap motion, offering a more accurate reflection of his actions compared to the original data.

This process allowed me to:
- Identify **motion frames** accurately.
- Count **how many players were in motion**.
- Categorize different types of motion, such as **no motion**, **single shift**, **multiple player shift**, **single player motion**, and **combined shifts and motions**.

As a result, I was able to generate more detailed pre-snap features, such as:
- **inMotionAtBallSnap**: Whether the player was in motion at the moment the ball was snapped.
- **shiftSinceLineset**: Whether a player shifted after the line was set.
- **motionSinceLineset**: Whether the player exhibited motion after the line was set.

In the example below is the pre-snap motion analysis for a play involving a 4-man shift. In the given data, all 5 players have motionSinceLineShift = True. But as you can see, DK Metcalf does not meet the requirements of my motion capturing function, excluding him from the motion player dataframe. This was confirmed via my play animation code.

By extracting these nuanced features from the tracking data, I was able to refine my predictive model’s ability to identify shifts and motions, ultimately improving its performance.

<div style="display: flex; justify-content: space-evenly;">
    <img src="https://raw.githubusercontent.com/TechMax14/BIG-DATA-BOWL-25/main/visuals/SEA%20Speed%20Plot%20Presnap/speed_vs_frameId_Colby%20Parkinson.png" width="200"/>
    <img src="https://raw.githubusercontent.com/TechMax14/BIG-DATA-BOWL-25/main/visuals/SEA%20Speed%20Plot%20Presnap/speed_vs_frameId_DK%20Metcalf.png" width="200"/>
    <img src="https://raw.githubusercontent.com/TechMax14/BIG-DATA-BOWL-25/main/visuals/SEA%20Speed%20Plot%20Presnap/speed_vs_frameId_Rashaad%20Penny.png" width="200"/>
    <img src="https://raw.githubusercontent.com/TechMax14/BIG-DATA-BOWL-25/main/visuals/SEA%20Speed%20Plot%20Presnap/speed_vs_frameId_Tyler%20Lockett.png" width="200"/>
    <img src="https://raw.githubusercontent.com/TechMax14/BIG-DATA-BOWL-25/main/visuals/SEA%20Speed%20Plot%20Presnap/speed_vs_frameId_Will%20Dissly.png" width="200"/>
</div>

<center>Figure 6. Pre-snap Motion Detection for the Seattle Seahawks Offensive Key Players (gameId: 2022092511, playId: 735).</center>
<br>