<a href="https://colab.research.google.com/github/Keoni808/NFL_Data_Cleaning/blob/main/NFL_Plays_Cleaning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

PURPOSE:
- To break down every play that has been scraped from NFL_Scraper to where it is usable. In the datas raw form, there are descriptions of each play that has happened but these descriptions come in the form of sentences full of information that is usable.
  - The information that is usable that I would like to attain are features such as:
    1. PlayType (e.g., pass, run, etc.)
    2. TimeOnTheClock (when the play occurred)
    3. Formation (offensive formation during the play)
    4. Player Involvement (passer, rusher, receiver, etc.)
    5. Outcome Details (yardage gained, direction, tackles, etc.)
    6. Penalties and Injuries (any penalties, injured players, etc.)
    7. ...and more.

- Technology stack
  - Google BigQuery
    - Currently holding all raw data that will be cleaned

**STATUS AND FUTURE ISSUES**
- Currently this iteration is working on a single game (Super Bowl 2023)
  - Once all plays have been broken down for this game, will move on to an entire season.
    - **IMPORTANT**
      - This one game does not cover every type of play. There will be other plays that this current program will not correctly break down, if it does at all.
      - I need to keep this in mind and figure out some way to raise an error when a play has not been correctly handled.

# MOUNTING AND IMPORTS

In [None]:
# Mount your Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Used to access personal google cloud services
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


In [None]:
# imports

# Data manipulation
import pandas as pd

# Regular expressions
import re

# Grab data from database
from google.cloud import bigquery

# LOADING DATA (BigQuery queries)


In [None]:
# Client connect to bigquery project
client = bigquery.Client('nfl-data-430702')

## ALL PLAYS 2023
 - For the future. Currently not working on this yet.

In [None]:
# nfl_2023_plays_query = """
#                        SELECT *
#                        FROM `nfl-data-430702.NFL_Scores.NFL-Plays-2023`
#                        """

# # Run the query, and return a pandas DataFrame
# dry_run_config = bigquery.QueryJobConfig(dry_run=True)
# dry_run_query = client.query(nfl_2023_plays_query, job_config=dry_run_config)
# print("This query will process {} bytes.".format(dry_run_query.total_bytes_processed))

# safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**9)
# safe_config_query = client.query(nfl_2023_plays_query, job_config=safe_config)

# # df_nfl_scores_data = safe_config_query.to_dataframe()

In [None]:
# df_2023_plays = safe_config_query.to_dataframe()

In [None]:
# df_2023_plays.head()

##SUPER BOWL PLAYS 2023

In [None]:
# Grabbing all plays from Super Bowl 2023
nfl_2023_sb_plays_query = """
                          SELECT *
                          FROM `nfl-data-430702.NFL_Scores.NFL-Plays-SuperBowl-2023`
                          """

# Running psuedo query, and returns the amount of bytes it will take to run query
dry_run_config = bigquery.QueryJobConfig(dry_run=True)
dry_run_query = client.query(nfl_2023_sb_plays_query, job_config=dry_run_config)
print("This query will process {} bytes.".format(dry_run_query.total_bytes_processed))

# Running query (Being mindful of the amount of data being grabbed)
# Will grab a maximum of a Gigabyte
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**9)
safe_config_query = client.query(nfl_2023_sb_plays_query, job_config=safe_config)

This query will process 41291 bytes.


In [None]:
# Putting data attained from query into a dataframe
df_2023_plays_sb = safe_config_query.to_dataframe()

In [None]:
# View of the raw data attained from NFL_Scraper
df_2023_plays_sb.head()

Unnamed: 0,Season,Week,Day,Date,AwayTeam,HomeTeam,Quarter,DriveNumber,TeamWithPossession,IsScoringDrive,PlayNumberInDrive,IsScoringPlay,PlayOutcome,PlayDescription,PlayStart
0,2023,Super Bowl,SUN,02/11,49ers,Chiefs,1ST QUARTER,2,KC,0,1,0,-3 Yard Run,(12:15) (Shotgun) I.Pacheco left guard to KC 2...,1st & 10 at KC 27
1,2023,Super Bowl,SUN,02/11,49ers,Chiefs,1ST QUARTER,2,KC,0,2,0,1 Yard Pass,(11:39) (Shotgun) P.Mahomes pass short left to...,2nd & 13 at KC 24
2,2023,Super Bowl,SUN,02/11,49ers,Chiefs,1ST QUARTER,2,KC,0,3,0,8 Yard Pass,(11:04) (Shotgun) P.Mahomes pass short right t...,3rd & 12 at KC 25
3,2023,Super Bowl,SUN,02/11,49ers,Chiefs,1ST QUARTER,2,KC,0,4,0,Punt,"(10:24) T.Townsend punts 43 yards to SF 24, Ce...",4th & 4 at KC 33
4,2023,Super Bowl,SUN,02/11,49ers,Chiefs,1ST QUARTER,4,KC,0,1,0,10 Yard Run,(6:28) (Shotgun) I.Pacheco right guard to KC 2...,1st & 10 at KC 11


In [None]:
# Observation of the amount of data being worked on
df_2023_plays_sb.shape

(191, 15)

# CATEGORIZE PLAYS
- The goal here is to parse out the different values for 'PlayOutcome'
  - separate pass / run / kickoff / etc.

## PARSING


In [None]:
# All play outcomes from the game
# - From here we can categorize and clean plays accordingly
df_2023_plays_sb['PlayOutcome'].unique()

array(['-3 Yard Run', '1 Yard Pass', '8 Yard Pass', 'Punt', '10 Yard Run',
       '-4 Yard Sack', 'Pass for No Gain', '4 Yard Run', 'Kickoff',
       '7 Yard Pass', '2 Yard Run', '5 Yard Run', '52 Yard Pass',
       'Fumble', 'Pass Incomplete', '3 Yard Run', 'Run for No Gain',
       '10 Yard Pass', 'Interception', '9 Yard Pass', '5 Yard Pass',
       '6 Yard Run', '18 Yard Pass', '11 Yard Pass', '11 Yard Run',
       '-2 Yard Run', '-5 Yard Penalty', '-10 Yard Penalty',
       '12 Yard Pass', '24 Yard Run', '1 Yard Run', 'Sack',
       '-8 Yard Pass', '-1 Yard Run', '8 Yard Run', '6 Yard Pass',
       '21 Yard Pass', '3 Yard Pass', '-1 Yard Sack', 'Field Goal',
       '22 Yard Run', '2 Yard Pass', 'Touchdown Chiefs',
       'Extra Point Good', '16 Yard Pass', '13 Yard Pass', '25 Yard Pass',
       '9 Yard Run', '-3 Yard Sack', '22 Yard Pass', '-3 Yard Pass',
       '4 Yard Pass', '19 Yard Run', '5 Yard Penalty', '19 Yard Pass',
       '-4 Yard Run', '7 Yard Run', '16 Yard Run', 'Touch

In [None]:
# Looking at all unique play outcomes and categorizing them.
# - This type of approach does not feel very flexable because a play outcome can
#   arise that has not been seen yet.
# - There may be more in the future when working on a full season, let alone all seasons and future games
df_2023_pass_sb = df_2023_plays_sb[df_2023_plays_sb['PlayOutcome'].str.contains('Pass')]
df_2023_run_sb = df_2023_plays_sb[df_2023_plays_sb['PlayOutcome'].str.contains('Run')]

df_2023_punt_sb = df_2023_plays_sb[df_2023_plays_sb['PlayOutcome'].str.contains('Punt')]
df_2023_sack_sb = df_2023_plays_sb[df_2023_plays_sb['PlayOutcome'].str.contains('Sack')]
df_2023_kickoff_sb = df_2023_plays_sb[df_2023_plays_sb['PlayOutcome'].str.contains('Kickoff')]
df_2023_fumble_sb = df_2023_plays_sb[df_2023_plays_sb['PlayOutcome'].str.contains('Fumble')]
df_2023_interception_sb = df_2023_plays_sb[df_2023_plays_sb['PlayOutcome'].str.contains('Interception')]
df_2023_penalty_sb = df_2023_plays_sb[df_2023_plays_sb['PlayOutcome'].str.contains('Penalty')]
df_2023_fieldgoal_sb = df_2023_plays_sb[df_2023_plays_sb['PlayOutcome'].str.contains('Field Goal')]
df_2023_touchdown_sb = df_2023_plays_sb[df_2023_plays_sb['PlayOutcome'].str.contains('Touchdown')]
df_2023_extrapoint_sb = df_2023_plays_sb[df_2023_plays_sb['PlayOutcome'].str.contains('Extra Point')]

plays_list = [df_2023_pass_sb,
              df_2023_run_sb,
              df_2023_punt_sb,
              df_2023_sack_sb,
              df_2023_kickoff_sb,
              df_2023_fumble_sb,
              df_2023_interception_sb,
              df_2023_penalty_sb,
              df_2023_fieldgoal_sb,
              df_2023_touchdown_sb,
              df_2023_extrapoint_sb]

## SANITY CHECK (All Plays Accounted for)

In [None]:
# A check to make sure that all plays have been categorized.
# - The check puts all categorized plays into a single dataframe
#   and will compare with the original dataframe to make sure
#   that they are the same.
df_check = pd.DataFrame()
for i in plays_list:
  df_check = pd.concat([df_check, i])

In [None]:
df_check = df_check.sort_index()

In [None]:
df_2023_plays_sb.equals(df_check)

True

## HELPER METHODS

In [None]:
# PURPOSE:
# - Quick look at a section of plays
#   - Ideally the plays that the user wants to break down and clean.
# INPUT PARAMETERS:
# df_all_plays      - DataFrame - The original dataframe where the desired plays to view came from
# df_section_plays  - DataFrame - A section of the original dataframe the user wants to view
# RETURN:
# - Printing to the console:
#   1. index of play
#   2. 'PlayDescription' feature of play
#   3. 'PlayOutcome' feature of play
def print_plays(df_all_plays, df_section_plays):
  for idx, value in df_section_plays['PlayOutcome'].items():
    print("index:" + str(idx))
    play = df_all_plays['PlayDescription'].iloc[idx]
    print(play)
    print(value)
    print()

# FEATURE BREAKDOWN 'PlayDescription'

ISSUES:
- laterals?
- penalties
  - Multiple penalties within a single play

- I need a check to make sure that all plays have been broken down
  - Possibly add a check for each type of play that happened?

- Touchdowns
  - A passing touchdown is not included within the passing category, it is in its own.

- Fumbles
  - How do I break this down?
  - What happens if theres a fumble after a fumble and it keeps going?

- Error correction catching system.
  - I need to raise errors when something does not break down correctly.



## NEW ADDED FEATURES

In [None]:
###########################
# NEW COLUMN DESCRIPTIONS #
###########################

# PlayType           - The type of play (e.g. pass/run)
# TimeOnTheClock     - The time that was on the clock when the play started
# Formation          - Play formation
# Passer             - Player that threw the ball (mostly the quarterback)
# Rusher             - Player that ran the ball (mostly the runningback)
# Receiver           - Player on the same team as the passer that caught the ball
# PassType           - Whether the pass was a deep or short pass?
# Direction          - Where the ball is going during the play
# Yardage            - Yards gained during the play
# TackleBy1          - Main tackler on the play (could be solo or could be with someone else)
# TackleBy2          - Assisted tackler1
# PressureBy         - Defender that applied pressure to the passer
# ForcedFumbleBy     - Defender that forced a fumble
# AfterFumble        - A list that has what happened after the fumble
#                      - [recovered by, yards gained, tackled by]
# InjuredPlayers     - Players that were injured during the play
# PenaltyDescription - If there is a penalty, gives a description of it
#                      - [who caused the penalty, what was the penalty, yards lost if penalty accepted]
# Yardage            - Total yardage gained on intended play (yardage gained from penalties and fumble recoveries do not count)

new_columns = ["PlayType", "TimeOnTheClock", "Formation", "Passer", "Rusher", "Receiver", "PassType", "Direction", "Yardage",
               "TackleBy1", "TackleBy2", "PressureBy", "ForcedFumbleBy",
               "AfterFumble",
               "InjuredPlayers", "PenaltyDescription"]

string_columns = ["PlayType", "TimeOnTheClock", "Formation", "Passer", "Rusher", "Receiver", "PassType", "Direction",
                  "TackleBy1", "TackleBy2", "PressureBy", "ForcedFumbleBy",
                  "AfterFumble",
                  "InjuredPlayers", "PenaltyDescription"]

int_columns = ["Yardage"]

In [None]:
####################################################
# REGULAR EXPRESSIONS USED TO LOCATE SPECIFIC DATA #
####################################################

################
# PLAY DETAILS #
################

time_on_clock_pattern = r'\(\d*:\d+\)'
formation = r'\([A-Za-z]+ ?[A-Za-z]*,? ?[A-Za-z]*\)'
manual_yardage = r'\d+ yards?' # Used when 'PlayOutcome' does not have yardage gained from intended play

#################
# NAMES OFFENSE #
#################

name_pattern = r'\b[A-Za-z]+\.[A-Za-z]+-?[A-Za-z]*\b' # Grabs all names but will only be used for Passer
receiver_name_pattern = r'\b [A-Za-z]+\.[A-Za-z]+-?[A-Za-z]*\b' # Receivers have a space before their name
rusher_pattern = r'\b[A-Za-z]+\.[A-Za-z]+-?[A-Za-z]* \b' # Runningbacks, like quarterbacks, are the first names in play descriptions

#################
# NAMES DEFENSE #
#################

defense_tackler_1_name_pattern = r'\([A-Za-z]+\.[A-Za-z]+-?[A-Za-z]*' # Will have a "(" in front of the name
defense_tackler_2_name_pattern = r' [A-Za-z]+\.[A-Za-z]+-?[A-Za-z]*\)' # Will have a ")" at the end of the name
defense_pressure_name_pattern = r'\[[A-Za-z]+\.[A-Za-z]+-?[A-Za-z]*\]' # Surrounded by "[]" brackets

########################
# TEAM IDENTIFIED NAME #
########################

team_identified_name = r'-[A-Za-z]+\.[A-Za-z]+-?[A-Za-z]*' # team initials comes before their name (e.g. KC-B.Bob).
                                                           # - This occurs when there is an injury, penalty, fumble recovery.

## PASS

In [None]:
# Creating a copy df so the original is not messed with
df_2023_pass_sb_detailed = df_2023_pass_sb.copy()
df_2023_pass_sb_detailed = df_2023_pass_sb_detailed.reindex(columns=df_2023_pass_sb_detailed.columns.tolist() + new_columns)
df_2023_pass_sb_detailed[string_columns] = df_2023_pass_sb_detailed[string_columns].astype(str)
df_2023_pass_sb_detailed[int_columns] = df_2023_pass_sb_detailed[int_columns].astype(float)

### IDENTIFYING DIFFERENT PASS PLAYS
- This section is used to categorize different pass plays to see if they have to be handled differently.
  - Eventually, each category of pass play will break down into the same set of features. The question here is how does each category break down to fall into these common features?

GOAL: To create a single method that will handle each type of pass play and break them down to a common set of features.

In [None]:
df_2023_pass_sb['PlayOutcome'].unique()

array(['1 Yard Pass', '8 Yard Pass', 'Pass for No Gain', '7 Yard Pass',
       '52 Yard Pass', 'Pass Incomplete', '10 Yard Pass', '9 Yard Pass',
       '5 Yard Pass', '18 Yard Pass', '11 Yard Pass', '12 Yard Pass',
       '-8 Yard Pass', '6 Yard Pass', '21 Yard Pass', '3 Yard Pass',
       '2 Yard Pass', '16 Yard Pass', '13 Yard Pass', '25 Yard Pass',
       '22 Yard Pass', '-3 Yard Pass', '4 Yard Pass', '19 Yard Pass',
       '17 Yard Pass', '20 Yard Pass', '23 Yard Pass', '24 Yard Pass'],
      dtype=object)

In [None]:
# 3 different formats as far as I can see.
# 1. '# Yard Pass'
# 2. 'Pass Incomplete'
# 3. 'Pass for No Gain'

# NOTE:
# - I have worked through all of these different pass plays.
#   - I am now using 'df_2023_pass_sb' which is the dataframe holding
#     all passing plays.
# - This is here just as documentation that these are the different sets of
#   passing plays that were worked on.

df_successful_passes = df_2023_pass_sb[df_2023_pass_sb['PlayOutcome'].str.contains('Yard Pass')]
df_incomplete_passes = df_2023_pass_sb[df_2023_pass_sb['PlayOutcome'].str.contains('Pass Incomplete')]
df_pass_for_no_gain = df_2023_pass_sb[df_2023_pass_sb['PlayOutcome'].str.contains('Pass for No Gain')]

In [None]:
# Sanity check
# - Make sure that all pass plays have been accounted for

df_all_pass_plays = [df_successful_passes, df_incomplete_passes, df_pass_for_no_gain]
df_check = pd.DataFrame()
for i in df_all_pass_plays:
  df_check = pd.concat([df_check, i])

df_check = df_check.sort_index()
df_2023_pass_sb.equals(df_check)

True

### PASS BREAKDOWN

In [None]:
print_plays(df_2023_plays_sb, df_2023_pass_sb)

index:1
(11:39) (Shotgun) P.Mahomes pass short left to T.Kelce to KC 25 for 1 yard (C.Young; D.Greenlaw).
1 Yard Pass

index:2
(11:04) (Shotgun) P.Mahomes pass short right to J.McKinnon to KC 33 for 8 yards (F.Warner, D.Greenlaw).
8 Yard Pass

index:6
(5:15) (Shotgun) P.Mahomes pass short left to R.Rice to KC 17 for no gain (F.Warner).
Pass for No Gain

index:10
(14:48) P.Mahomes pass short left to I.Pacheco pushed ob at KC 32 for 7 yards (T.Gipson).
7 Yard Pass

index:13
(13:01) (Shotgun) P.Mahomes pass deep right to M.Hardman to SF 9 for 52 yards (J.Brown).
52 Yard Pass

index:15
(9:16) (Shotgun) P.Mahomes pass incomplete short left [C.Young]. PENALTY on KC-P.Mahomes, Intentional Grounding, 10 yards, enforced at KC 20.
Pass Incomplete

index:21
(14:15) (Shotgun) P.Mahomes pass short middle to N.Gray to KC 23 for 10 yards (L.Ryan).
10 Yard Pass

index:23
(12:31) (Shotgun) P.Mahomes pass incomplete deep left to M.Valdes-Scantling.
Pass Incomplete

index:24
(12:26) (Shotgun) P.Mahomes p

## **Fumble Notes (Different types of documented 'fumble' plays)**

**PASS**

1. index:98

  (9:22) (Shotgun) P.Mahomes to KC 49 for -5 yards. FUMBLES, and recovers at KC 48. P.Mahomes pass incomplete deep middle [N.Bosa].

  Pass Incomplete

  **(Fumble classification - standard)**

**RUN**

2. index:20
  
  (15:00) P.Mahomes FUMBLES (Aborted) at KC 17, recovered by KC-I.Pacheco at KC 15. I.Pacheco to KC 13 for -2 yards (N.Bosa).
  
  Run for No Gain

  **(Fumble classification - Aborted)**

3. index:12

  (13:41) R.Rice right tackle to KC 37 for 3 yards (L.Ryan; D.Greenlaw). FUMBLES (L.Ryan), recovered by KC-Ju.Watson at KC 37. Ju.Watson to KC 39 for 2 yards (J.Brown).

  5 Yard Run

  **(Fumble classification - Forced Fumble)**

PLAN:
- I need a larger sample size of fumble plays. (Work on Week 1 2023)
  - I think that each 'fumble' play is broken down into sentences in accordance to how things happened in real time? But each of these setences follow a pattern (I think).
    - Pattern 1 (Play): There is always a semi generic play, meaning it follows the pattern of a run or a pass play, that occurs somewhere.
    - Pattern 2 (Fumble): One of the sentences will say a fumble and a recovery. SOMETIMES the cause depending on which type of fumble it is.
      TYPES OF FUMBLES:
      1. Standard - Ball carrier drops the ball
      2. Aborted - Transfer between center and Quarterback (or passer) was disrupted
      3. Force Fumble - Defender knocks ball loose from ball carrier
  - The last sentence (if there is one) will be something of another play.
     - **This only happens in Standard fumbles and Forced fumbles.**


1. split descriptions into individual sentences.
2. Find sentence that says fumble and who recovered.
  - 1. FUMBLES, and recovers at KC 48.
  - 2. (15:00) P.Mahomes FUMBLES (Aborted) at KC 17, recovered by KC-I.Pacheco at KC 15.
  - 3. FUMBLES (L.Ryan), recovered by KC-Ju.Watson at KC 37.

In [None]:
# Will turn this into a function eventually

# 1. NEED TO HANDLE FUMBLES
# 2. YARDAGE MAY BE OFF. ('PlayOutcome' gives yardage total gain from play. This includes penalties but I do not want added yardage from penalties)

for idx, value in df_2023_pass_sb['PlayOutcome'].items():
  play = df_2023_plays_sb['PlayDescription'].iloc[idx]

  ################
  # Play details #
  ################

  # Play Type
  df_2023_pass_sb_detailed.loc[idx, 'PlayType'] = 'Pass'

  # TimeOnTheClock
  TimeOnTheClock = re.findall(time_on_clock_pattern, play)
  df_2023_pass_sb_detailed.loc[idx, 'TimeOnTheClock'] = TimeOnTheClock[0][1:-1]

  #############
  #  OFFENSE  #
  #############

  # Formation
  Formation = re.findall(formation, play)
  if len(Formation) > 0:
    df_2023_pass_sb_detailed.loc[idx, 'Formation'] = Formation[0][1:-1]
  # Passer & Receiver
  Passer = re.findall(name_pattern, play)
  df_2023_pass_sb_detailed.loc[idx, 'Passer'] = Passer[0] # Quarterback
  Receiver = re.findall(receiver_name_pattern, play)
  if len(Receiver) > 0:
    df_2023_pass_sb_detailed.loc[idx, 'Receiver'] = Receiver[0][1:] # Receiver

  # Yardage and PassType
  # Yardage is set to 0 for 'incomplete' and 'no gain'
  # - Will change when pass is successful for gain
  df_2023_pass_sb_detailed.loc[idx, 'Yardage'] = 0
  if value.find('Incomplete') != -1:  # For Incomplete passes
    df_2023_pass_sb_detailed.loc[idx, 'PassType'] = 'Incomplete'
  elif value.find('No Gain') != -1: # For successful passes with no gain
    if play.find('short') != -1:
      df_2023_pass_sb_detailed.loc[idx, 'PassType'] = 'Short'
    elif play.find('deep') != -1:
      df_2023_pass_sb_detailed.loc[idx, 'PassType'] = 'Deep'
  else: # For successful passes
    if int(value.split()[0]) < 20:
      df_2023_pass_sb_detailed.loc[idx, 'PassType'] = 'Short'
    else:
      df_2023_pass_sb_detailed.loc[idx, 'PassType'] = 'Deep'
    # Yardage gained on play from successful pass (Is this true? Value will give total yards gained from a play and this includes penalties)
    df_2023_pass_sb_detailed.loc[idx, 'Yardage'] = int(value.split()[0])

  # Pass Direction
  if play.find('left') != -1:
    df_2023_pass_sb_detailed.loc[idx, 'Direction'] = 'Left'
  elif play.find('right') != -1:
    df_2023_pass_sb_detailed.loc[idx, 'Direction'] = 'Right'
  elif play.find('middle') != -1:
    df_2023_pass_sb_detailed.loc[idx, 'Direction'] = 'Middle'

  #############
  #  DEFENSE  #
  #############

  tackler_1 = re.findall(defense_tackler_1_name_pattern, play) # tackler #1 (Could be solo or the one who initiated the hit)
  if len(tackler_1) > 0:
    df_2023_pass_sb_detailed.loc[idx, 'TackleBy1'] = tackler_1[0][1:]
  tackler_2 = re.findall(defense_tackler_2_name_pattern, play) # tackler #2 (equally contributed or assisted with tackle)
  if len(tackler_2) > 0:
    df_2023_pass_sb_detailed.loc[idx, 'TackleBy2'] = tackler_2[0][1:-1]
  pressure = re.findall(defense_pressure_name_pattern, play)   # Player who applied pressure to passer
  if len(pressure) > 0:
    df_2023_pass_sb_detailed.loc[idx, 'PressureBy'] = pressure[0][1:-1]

  #############
  #  PENALTY  #
  #############

  if play.find('Penalty') != -1 or play.find('PENALTY') != -1:
    # Splitting the description by sentences.
    # 1st sentence: Contains the play ran
    # 2nd sentence -> ?: Penalty breakdown (who, what, how many yards, accept/decline)
    penalty_play_elements = play.split(". ")
    # There may be more than 1 penalty, so they will all go within a list
    penalties = []
    for i in penalty_play_elements[1::]:
      penalty_breakdown = []
      # penalty breakdowns have key details separated by commas
      penalty = i.split(", ")
      # Which player caused penalty is the first key detail
      penalty_called_on = re.findall(team_identified_name, i)
      penalty_breakdown.append(penalty_called_on[0][1:])
      # Adding the rest of the penalty details
      # 1. what the penalty was
      # 2. the yardage given based off penalty
      # 3. enforced or denied
      for j in penalty[1::]:
        penalty_breakdown.append(j)
      penalties.append(penalty_breakdown)
    df_2023_pass_sb_detailed.at[idx, 'PenaltyDescription'] = penalties

  ##########
  # INJURY #
  ##########

  if play.find('injured') != -1:
    injured_name = re.findall(team_identified_name, play)
    df_2023_pass_sb_detailed.at[idx, 'InjuredPlayers'] = [x[1:] for x in injured_name]

In [None]:
df_2023_pass_sb_detailed[["PlayDescription", "PlayType", "TimeOnTheClock", "Formation", "Passer", "Receiver", "PassType", "Direction", "Yardage",
                          "TackleBy1", "TackleBy2", "PressureBy",
                          "InjuredPlayers", "PenaltyDescription"]]

Unnamed: 0,PlayDescription,PlayType,TimeOnTheClock,Formation,Passer,Receiver,PassType,Direction,Yardage,TackleBy1,TackleBy2,PressureBy,InjuredPlayers,PenaltyDescription
1,(11:39) (Shotgun) P.Mahomes pass short left to...,Pass,11:39,Shotgun,P.Mahomes,T.Kelce,Short,Left,1.0,C.Young,D.Greenlaw,,,
2,(11:04) (Shotgun) P.Mahomes pass short right t...,Pass,11:04,Shotgun,P.Mahomes,J.McKinnon,Short,Right,8.0,F.Warner,D.Greenlaw,,,
6,(5:15) (Shotgun) P.Mahomes pass short left to ...,Pass,5:15,Shotgun,P.Mahomes,R.Rice,Short,Left,0.0,F.Warner,,,,
10,(14:48) P.Mahomes pass short left to I.Pacheco...,Pass,14:48,,P.Mahomes,I.Pacheco,Short,Left,7.0,T.Gipson,,,,
13,(13:01) (Shotgun) P.Mahomes pass deep right to...,Pass,13:01,Shotgun,P.Mahomes,M.Hardman,Deep,Right,52.0,J.Brown,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
177,(14:55) (Shotgun) B.Purdy pass short middle to...,Pass,14:55,Shotgun,B.Purdy,C.McCaffrey,Short,Middle,2.0,C.Jones,,,,
181,(12:38) (Shotgun) B.Purdy pass short left to B...,Pass,12:38,Shotgun,B.Purdy,B.Aiyuk,Short,Left,11.0,M.Edwards,,,,
183,(11:12) (Shotgun) B.Purdy pass short left to C...,Pass,11:12,Shotgun,B.Purdy,C.McCaffrey,Deep,Left,24.0,L.Sneed,,G.Karlaftis,,
186,(9:25) (Shotgun) B.Purdy pass short right to K...,Pass,9:25,Shotgun,B.Purdy,K.Juszczyk,Short,Right,13.0,,,,,


In [None]:
focused_row = df_2023_pass_sb_detailed.loc[df_2023_pass_sb_detailed.index == 98]

In [None]:
focused_row[["PlayType", "TimeOnTheClock", "Formation", "Passer", "Rusher", "Receiver", "PassType", "Direction", "Yardage",
               "TackleBy1", "TackleBy2", "PressureBy", "ForcedFumbleBy",
               "AfterFumble",
               "InjuredPlayers", "PenaltyDescription"]]

Unnamed: 0,PlayType,TimeOnTheClock,Formation,Passer,Rusher,Receiver,PassType,Direction,Yardage,TackleBy1,TackleBy2,PressureBy,ForcedFumbleBy,AfterFumble,InjuredPlayers,PenaltyDescription
98,Pass,9:22,Shotgun,P.Mahomes,,,Incomplete,Middle,0.0,,,N.Bosa,,,,


In [None]:
df_2023_pass_sb_detailed.loc[df_2023_pass_sb_detailed.index == 98]

## RUN

### IDENTIFYING DIFFERENT RUN PLAYS
- This section is used to categorize different run plays to see if they have to be handled differently.
  - Eventually, each category of run play will break down into the same set of features. The question here is how does each category break down to fall into these common features?

GOAL: To create a single method that will handle each type of run play and break them down to a common set of features.

In [None]:
df_2023_run_sb['PlayOutcome'].unique()

array(['-3 Yard Run', '10 Yard Run', '4 Yard Run', '2 Yard Run',
       '5 Yard Run', '3 Yard Run', 'Run for No Gain', '6 Yard Run',
       '11 Yard Run', '-2 Yard Run', '24 Yard Run', '1 Yard Run',
       '-1 Yard Run', '8 Yard Run', '22 Yard Run', '9 Yard Run',
       '19 Yard Run', '-4 Yard Run', '7 Yard Run', '16 Yard Run'],
      dtype=object)

In [None]:
# 2 different formats ?

# Taking these one at a time
# Currently on 'df_yard_runs'

# Is there a way to automate this process?

df_yard_runs = df_2023_run_sb[df_2023_run_sb['PlayOutcome'].str.contains('Yard Run')]
df_run_for_no_gain = df_2023_run_sb[df_2023_run_sb['PlayOutcome'].str.contains('Run for No Gain')]

In [None]:
###########################
# NEW COLUMN DESCRIPTIONS #
###########################

df_2023_run_sb_detailed = df_2023_run_sb.copy()
df_2023_run_sb_detailed = df_2023_run_sb_detailed.reindex(columns=df_2023_run_sb_detailed.columns.tolist() + new_columns)
df_2023_run_sb_detailed[string_columns] = df_2023_run_sb_detailed[string_columns].astype(str)
df_2023_run_sb_detailed[int_columns] = df_2023_run_sb_detailed[int_columns].astype(float)

## RUN BREAKDOWN

In [None]:
print_plays(df_2023_plays_sb, df_2023_run_sb)

index:0
(12:15) (Shotgun) I.Pacheco left guard to KC 24 for -3 yards (N.Bosa, J.Hargrave).
-3 Yard Run

index:4
(6:28) (Shotgun) I.Pacheco right guard to KC 21 for 10 yards (L.Ryan; J.Brown).
10 Yard Run

index:7
(4:36) (Shotgun) P.Mahomes scrambles up the middle to KC 21 for 4 yards (R.Gregory).
4 Yard Run

index:11
(14:15) (Shotgun) I.Pacheco right guard to KC 34 for 2 yards (K.Givens).
2 Yard Run

index:12
(13:41) R.Rice right tackle to KC 37 for 3 yards (L.Ryan; D.Greenlaw). FUMBLES (L.Ryan), recovered by KC-Ju.Watson at KC 37. Ju.Watson to KC 39 for 2 yards (J.Brown).
5 Yard Run

index:16
(9:07) (Shotgun) I.Pacheco up the middle to KC 14 for 4 yards (F.Warner; J.Kinlaw).
4 Yard Run

index:17
(8:35) (Shotgun) P.Mahomes scrambles right tackle to KC 17 for 3 yards (A.Armstead).
3 Yard Run

index:20
(15:00) P.Mahomes FUMBLES (Aborted) at KC 17, recovered by KC-I.Pacheco at KC 15. I.Pacheco to KC 13 for -2 yards (N.Bosa).
Run for No Gain

index:25
(11:46) I.Pacheco up the middle to KC 

In [None]:
# All run plays

for idx, value in df_2023_run_sb['PlayOutcome'].items():
  play = df_2023_plays_sb['PlayDescription'].iloc[idx]

  # Yardage from play. (grabbed from play outcome)
  # - Will be rewritten if there is a fumble or penalty
  if value == "Run for No Gain":
    df_2023_run_sb_detailed.loc[idx, 'Yardage'] = 0
  else:
    df_2023_run_sb_detailed.loc[idx, 'Yardage'] = int(value.split()[0])

  ##########################
  # MISTAKES AND TURNOVERS #
  ##########################

  # Fumble (Two types have been encountered. "Aborted" and "Force Fumble")
  if play.find('FUMBLES') != -1:
    # Splitting play by sentences
    fumble_play_elements = play.split(". ")
    # I have seen 2 possible sentences that start fumbled plays
    # 1. The intended play
    # 2. An aborted fumble (meaning a mistake happened, such as an off snap)
    play = fumble_play_elements[0] # Updated play description

    # Yardage from intended running play
    # - Will not receive yardage from an aborted fumble
    yardage = re.findall(manual_yardage, play)
    if len(yardage) > 0:
      df_2023_run_sb_detailed.loc[idx, 'Yardage'] = int(yardage[0].split()[0])
      print(int(yardage[0].split()[0]))

    # The rest of the sentences will either be:
    # 1. Who caused the fumble and who recovered (forced fumble)
    # 2. Who recovered and yardage after (Aborted fumble)
    # - The reason for the loop is just in case there is more than 1 fumble on the play
    for i in fumble_play_elements[1::]:
      # 1. Who caused the fumble and who recovered (Will only grab who caused fumble)
      if i.find('FUMBLES') != -1:
        # Player who forced fumble
        player_forced_fumble = re.findall(defense_tackler_1_name_pattern, i)
        # if len(player_forced_fumble) > 0:
        df_2023_run_sb_detailed.loc[idx, 'ForcedFumbleBy'] = player_forced_fumble[0][1:]

      # 2. Who recovered and yardage after
      else:
        # list element will contain [who recovered, yardage, who tackled]
        action_after_fumble = []
        # Who Recovered
        player_running_after_fumble = re.findall(rusher_pattern, i)
        action_after_fumble.append(player_running_after_fumble[0])
        # Yardage gained
        yardage_gained_after_fumble = re.findall(manual_yardage, i)
        action_after_fumble.append(yardage_gained_after_fumble[0])
        # Player who tackled
        tackler = re.findall(defense_tackler_1_name_pattern, i)
        action_after_fumble.append(tackler[0][1:])
        ################# I NEED TO ADD THIS TO A LIST JUST IN CASE OF MULTIPLE FUMBLES.
        df_2023_run_sb_detailed.at[idx, 'AfterFumble'] = action_after_fumble

    if play.find('(Aborted)') != -1:
      # Play Type
      df_2023_run_sb_detailed.loc[idx, 'PlayType'] = 'Run'
      # TimeOnTheClock
      TimeOnTheClock = re.findall(time_on_clock_pattern, play)
      df_2023_run_sb_detailed.loc[idx, 'TimeOnTheClock'] = TimeOnTheClock[0][1:-1]
      # Formation
      Formation = re.findall(formation, play)
      if len(Formation) > 0 and Formation[0][1:-1] != "Aborted":
        df_2023_run_sb_detailed.loc[idx, 'Formation'] = Formation[0][1:-1]
      # Rusher
      rusher_names = re.findall(rusher_pattern, play) # May grab name(s) bc regular expression. (Only want rusher)
      rusher_name = rusher_names[0][:-1]
      # Fumble by mistake
      df_2023_run_sb_detailed.loc[idx, 'ForcedFumbleBy'] = "Aborted"
      continue

  # Penalty
  if play.find('PENALTY') != -1:
    penalty_play_elements = play.split(". ")
    play = penalty_play_elements[0]
    # Yardage from intended rusher alone
    yardage = re.findall(manual_yardage, play)
    df_2023_run_sb_detailed.loc[idx, 'Yardage'] = int(yardage[0].split()[0])

    for i in penalty_play_elements[1::]:
      penalty_breakdown = []
      penalty = i.split(", ")
      # Player
      penalty_called_on = re.findall(team_identified_name, i)
      penalty_breakdown.append(penalty_called_on[0][1:])
      # Penalty
      penalty_breakdown.append(penalty[1])
      # Yardage from penalty
      penalty_breakdown.append(penalty[2])
      df_2023_run_sb_detailed.at[idx, 'PenaltyDescription'] = penalty_breakdown

  ################
  # Play details #
  ################

  # Play Type
  df_2023_run_sb_detailed.loc[idx, 'PlayType'] = 'Run'

  # TimeOnTheClock
  TimeOnTheClock = re.findall(time_on_clock_pattern, play)
  df_2023_run_sb_detailed.loc[idx, 'TimeOnTheClock'] = TimeOnTheClock[0][1:-1]

  #############
  #  OFFENSE  #
  #############

  # Formation
  Formation = re.findall(formation, play)
  if len(Formation) > 0:
    df_2023_run_sb_detailed.loc[idx, 'Formation'] = Formation[0][1:-1]
  # Rusher
  rusher_names = re.findall(rusher_pattern, play) # May grab name(s) bc regular expression. (Only want rusher)
  rusher_name = rusher_names[0][:-1]
  df_2023_run_sb_detailed.loc[idx, 'Rusher'] = rusher_name

  # Direction
  rushing_directions = ['guard', 'middle', 'tackle', 'end', 'kneels']
  for i in rushing_directions:
    if play.find(i) != -1:
      start = play.find(rusher_name) + len(rusher_name) + 1
      end = play.find(i) + len(i)
      # print(play[start:end])
      df_2023_run_sb_detailed.loc[idx, 'Direction'] = play[start:end]

  #############
  #  DEFENSE  #
  #############

  tackler_1 = re.findall(defense_tackler_1_name_pattern, play) # tackler #1 (Could be solo or the one who initiated the hit)
  if len(tackler_1) > 0:
    df_2023_run_sb_detailed.loc[idx, 'TackleBy1'] = tackler_1[0][1:]
  tackler_2 = re.findall(defense_tackler_2_name_pattern, play) # tackler #2 (equally contributed or assisted with tackle)
  if len(tackler_2) > 0:
    df_2023_run_sb_detailed.loc[idx, 'TackleBy2'] = tackler_2[0][1:-1]

3


In [None]:
df_2023_run_sb_detailed[['PlayDescription', 'TimeOnTheClock', 'Formation', 'Rusher', 'PlayType', 'Direction', 'Yardage', 'TackleBy1', 'TackleBy2', 'ForcedFumbleBy', 'AfterFumble', 'PenaltyDescription']]

Unnamed: 0,PlayDescription,TimeOnTheClock,Formation,Rusher,PlayType,Direction,Yardage,TackleBy1,TackleBy2,ForcedFumbleBy,AfterFumble,PenaltyDescription
0,(12:15) (Shotgun) I.Pacheco left guard to KC 2...,12:15,Shotgun,I.Pacheco,Run,left guard,-3.0,N.Bosa,J.Hargrave,,,
4,(6:28) (Shotgun) I.Pacheco right guard to KC 2...,6:28,Shotgun,I.Pacheco,Run,right guard,10.0,L.Ryan,J.Brown,,,
7,(4:36) (Shotgun) P.Mahomes scrambles up the mi...,4:36,Shotgun,P.Mahomes,Run,scrambles up the middle,4.0,R.Gregory,,,,
11,(14:15) (Shotgun) I.Pacheco right guard to KC ...,14:15,Shotgun,I.Pacheco,Run,right guard,2.0,K.Givens,,,,
12,(13:41) R.Rice right tackle to KC 37 for 3 yar...,13:41,,R.Rice,Run,right tackle,3.0,L.Ryan,D.Greenlaw,L.Ryan,"[Ju.Watson , 2 yards, J.Brown]",
16,(9:07) (Shotgun) I.Pacheco up the middle to KC...,9:07,Shotgun,I.Pacheco,Run,up the middle,4.0,F.Warner,J.Kinlaw,,,
17,(8:35) (Shotgun) P.Mahomes scrambles right tac...,8:35,Shotgun,P.Mahomes,Run,scrambles right tackle,3.0,A.Armstead,,,,
20,"(15:00) P.Mahomes FUMBLES (Aborted) at KC 17, ...",15:00,,,Run,,0.0,,,Aborted,"[I.Pacheco , 2 yards, N.Bosa]",
25,(11:46) I.Pacheco up the middle to KC 11 for n...,11:46,,I.Pacheco,Run,up the middle,0.0,J.Kinlaw,,,,
27,(3:59) (Shotgun) I.Pacheco up the middle to KC...,3:59,Shotgun,I.Pacheco,Run,up the middle,3.0,D.Flannigan-Fowles,S.Joseph,,,


In [None]:
df_2023_run_sb_detailed['Rusher'].unique()

array(['I.Pacheco', 'P.Mahomes', 'R.Rice', 'nan', 'C.McCaffrey',
       'D.Samuel', 'B.Purdy', 'C.Edwards-Helaire', 'E.Mitchell',
       'K.Juszczyk'], dtype=object)

In [None]:
df_2023_run_sb_detailed['PlayDescription'].loc[df_2023_run_sb_detailed['Rusher'] == 'R.Rice'].iloc[0]

'(13:41) R.Rice right tackle to KC 37 for 3 yards (L.Ryan; D.Greenlaw). FUMBLES (L.Ryan), recovered by KC-Ju.Watson at KC 37. Ju.Watson to KC 39 for 2 yards (J.Brown).'

In [None]:
df_2023_run_sb_detailed['PlayOutcome'].loc[df_2023_run_sb_detailed['Rusher'] == 'R.Rice'].iloc[0]

'5 Yard Run'

In [None]:
df_2023_run_sb_detailed['Yardage'].loc[df_2023_run_sb_detailed['Rusher'] == 'R.Rice'].iloc[0]

3.0

In [None]:
df_2023_run_sb_detailed['Yardage'].loc[df_2023_run_sb_detailed['Rusher'] == 'C.Edwards-Helaire'].sum()

0.0