<a href="https://colab.research.google.com/github/KeoniM/NFL_Data_Cleaning/blob/main/NFL_Plays_Week2_2023_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**PURPOSE:**
- Accurately clean a week's worth of play data
  - Season 2023 -> Week 2

**NOTE:**
- What makes version 2 different than version 1 is the data being used. Although the core of the data is identical to the original, NFL.com has updated their formatting of how they display their data which has been scraped and used here. So minor adjustments will have to be made in creating the new version but I also see a beautiful opportunity to clean the older version here. Make the code more readible, organized and efficient.

**STILL NEED TO WORK ON**
1. Fumble helper method
2. Laterals
   - Plays that have both a fumble and a lateral in it is going to be very tricky to clean.
3. Penalties

# MOUNTING AND IMPORTS

In [95]:
# Mount your Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [96]:
# Used to access personal google cloud services
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


In [97]:
# Imports

# Data manipulation
import pandas as pd

# Regular expressions
import re

# # Natural Language Toolkit (Used to find complete sentences)
# import nltk
# nltk.download('punkt')
# nltk.download('punkt_tab')
# from nltk.tokenize import sent_tokenize

import spacy
nlp = spacy.load("en_core_web_sm")

# Database access
from google.cloud import bigquery

# LOADING DATA (BigQuery)

In [98]:
# Client connect to bigquery project
client = bigquery.Client('nfl-data-430702')

## Season 2023 Week 2

In [99]:
# Grabbing all plays from 2023 Week 2 NFL Sesason
nfl_plays_week2_2023_query = """
                             SELECT *
                             FROM `nfl-data-430702.NFL_Scores_v2.NFL-Plays-Week2_2023`
                             """

# Running psuedo query, and returns the amount of bytes it will take to run query
dry_run_config = bigquery.QueryJobConfig(dry_run=True)
dry_run_query = client.query(nfl_plays_week2_2023_query, job_config=dry_run_config)
print("This query will process {} gigabytes.".format(dry_run_query.total_bytes_processed/10**9))

# Running query (Being mindful of the amount of data being grabbed)
# Will grab a maximum of a Gigabyte
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**9)
safe_config_query = client.query(nfl_plays_week2_2023_query, job_config=safe_config)

This query will process 0.000645655 gigabytes.


In [100]:
# Putting data attained from query into a dataframe
week2_2023_plays = safe_config_query.to_dataframe()

In [101]:
week2_2023_plays.head()

Unnamed: 0,Season,Week,Day,Date,AwayTeam,HomeTeam,Quarter,DriveNumber,TeamWithPossession,IsScoringDrive,PlayNumberInDrive,IsScoringPlay,PlayOutcome,PlayStart,PlayTimeFormation,PlayDescription
0,2023,Week 2,THU,09/14,Vikings,Eagles,1st Quarter,1,Philadelphia Eagles,1,1,0,Kickoff from MIN 35,,Kickoff,— G.Joseph kicks 65 yards from MIN 35 to end z...
1,2023,Week 2,THU,09/14,Vikings,Eagles,1st Quarter,1,Philadelphia Eagles,1,2,0,6 Yard Pass,1st & 10 at PHI 25,15:00 1st Shotgun,— J.Hurts pass short right to D.Smith to PHI 3...
2,2023,Week 2,THU,09/14,Vikings,Eagles,1st Quarter,1,Philadelphia Eagles,1,3,0,-1 Yard Sack,2nd & 4 at PHI 31,14:27 1st Shotgun,— J.Hurts sacked at PHI 30 for -1 yards (sack ...
3,2023,Week 2,THU,09/14,Vikings,Eagles,1st Quarter,1,Philadelphia Eagles,1,4,0,7 Yard Run,3rd & 5 at PHI 30,13:45 1st Shotgun,— J.Hurts scrambles right end pushed ob at PHI...
4,2023,Week 2,THU,09/14,Vikings,Eagles,1st Quarter,1,Philadelphia Eagles,1,5,0,-1 Yard Pass,1st & 10 at PHI 37,13:10 1st Shotgun,— J.Hurts pass short left to D.Goedert to PHI ...


# CATEGORIZE PLAYS
- The goal here is to parse out the different values for 'PlayOutcome'
  - Here is where I will separate different types of plays
    - ( pass / run / kickoff / etc..)

In [102]:
# All play outcomes from the game
# - From here we can categorize and clean plays accordingly
week2_2023_plays['PlayOutcome'].unique()

array(['Kickoff from MIN 35', '6 Yard Pass', '-1 Yard Sack', '7 Yard Run',
       '-1 Yard Pass', '54 Yard Pass', '1 Yard Pass', '1 Yard Run',
       '2 Yard Run', 'Field Goal', 'Kickoff from PHI 35', '15 Yard Pass',
       'Pass Incomplete', 'Punt', '-5 Yard Penalty', '3 Yard Run',
       'Fumble', '4 Yard Run', '12 Yard Run', '-7 Yard Sack',
       'Interception', '0 Yard Run', '5 Yard Pass', '-3 Yard Run',
       'Field Goal No Good', '5 Yard Penalty', '9 Yard Pass',
       '-2 Yard Run', '3 Yard Pass', '24 Yard Pass', '5 Yard Run',
       '7 Yard Pass', 'Touchdown', 'Extra Point', '6 Yard Run',
       '8 Yard Run', '11 Yard Pass', 'Timeout', '13 Yard Pass',
       '4 Yard Pass', '18 Yard Pass', '18 Yard Run', '0 Yard Pass',
       '-5 Yard Pass', '-10 Yard Penalty', '2 Yard Pass', '9 Yard Run',
       '11 Yard Run', '8 Yard Pass', '-2 Yard Sack', '-12 Yard Sack',
       '23 Yard Pass', '22 Yard Pass', '14 Yard Pass', '12 Yard Pass',
       '43 Yard Run', '10 Yard Pass', '16 Yard Pa

In [103]:
# NOTES:
# - Currently, I am eyeing all unique play outcomes to categorizing them.
#   - This type of approach is not flexable because a play outcome can
#     arise that has not been seen yet.
#     - There may be more play outcomes in the future when working on a full season,
#       let alone all seasons and future games

# Play Types with complete cleaning methods (As far as this sample size goes)

# ~ OFFENSE ~
df_2023_pass_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('Pass')]
df_2023_run_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('Run')]
# ~ DEFENSE ~
df_2023_interception_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('Interception')]
df_2023_sack_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('Sack')]
# ~ SPECIAL TEAMS ~
df_2023_punt_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('Punt')]
df_2023_kickoff_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('Kickoff')]
# ~ SCORING ~
df_2023_touchdown_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('Touchdown')]
df_2023_extrapoint_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('Extra Point')]
df_2023_fieldgoal_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('Field Goal')]
# df_2023_2pt_conversion_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('2PT Conversion')]
df_2023_2pt_conversion_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('Conversion')]
# ~ OTHER ~
df_2023_fumble_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('Fumble')]
df_2023_penalty_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('Penalty')]
df_2023_turnover_on_downs_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('Turnover on Downs')]
df_2023_timeout_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('Timeout')]

## SANITY CHECK (All Plays Accounted for)
  - Once all plays have been categorized, will compare the sum of all plays in each category to the size of the original dataframe of plays.
    - Goal is to make sure the number of plays is the same.

In [104]:
# Categorized plays

plays_list = [df_2023_pass_week2,         # Offense
              df_2023_run_week2,
              df_2023_interception_week2, # Defense
              df_2023_sack_week2,
              df_2023_punt_week2,         # Special Teams
              df_2023_kickoff_week2,
              df_2023_touchdown_week2,    # Scoring
              df_2023_extrapoint_week2,
              df_2023_fieldgoal_week2,
              df_2023_2pt_conversion_week2,
              df_2023_fumble_week2,       # Other
              df_2023_penalty_week2,
              df_2023_turnover_on_downs_week2,
              df_2023_timeout_week2]

num_plays_categorized = 0

for plays in plays_list:
  num_plays_categorized = num_plays_categorized + len(plays)

num_plays_categorized == len(week2_2023_plays)

True

# PIPELINE
- ORDER
  1. Team Dictionary
    - Used to map team names with their acronyms
  2. Regular expressions
    - Used to find common patterns within raw data
  3. Transforming Data
    - So far, only label encoding
  4. Cleaning methods
    - Unique cleaning methods for each play type
  5. Main pipeline method
    - Control flow of cleaning methods

## 1. TEAM DICTIONARY

In [105]:
# KEY: Team name
# VALUE: Acronym of team

dict_teams = {
    'Cardinals': 'ARI', 'Falcons': 'ATL', 'Ravens': 'BAL', 'Bills': 'BUF', 'Panthers': 'CAR', 'Bears': 'CHI',
    'Bengals': 'CIN', 'Browns': 'CLE', 'Cowboys': 'DAL', 'Broncos': 'DEN', 'Lions': 'DET', 'Packers': 'GB',
    'Texans': 'HOU', 'Colts': 'IND', 'Jaguars': 'JAX', 'Chiefs': 'KC', 'Raiders': 'LV', 'Chargers': 'LAC',
    'Rams': 'LAR', 'Dolphins': 'MIA', 'Vikings': 'MIN', 'Patriots': 'NE', 'Saints': 'NO', 'Giants': 'NYG',
    'Jets': 'NYJ', 'Eagles': 'PHI', 'Steelers': 'PIT', '49ers': 'SF', 'Seahawks': 'SEA', 'Buccaneers': 'TB',
    'Titans': 'TEN', 'Commanders': 'WAS'
}

In [106]:
# KEY: Full Team name
# VALUE: Acronym of team

dict_teams_2 = {
    'Arizona Cardinals': 'ARI', 'Atlanta Falcons': 'ATL', 'Baltimore Ravens': 'BAL', 'Buffalo Bills': 'BUF', 'Carolina Panthers': 'CAR', 'Chicago Bears': 'CHI',
    'Cincinnati Bengals': 'CIN', 'Cleveland Browns': 'CLE', 'Dallas Cowboys': 'DAL', 'Denver Broncos': 'DEN', 'Detroit Lions': 'DET', 'Green Bay Packers': 'GB',
    'Houston Texans': 'HOU', 'Indianapolis Colts': 'IND', 'Jacksonville Jaguars': 'JAX', 'Kansas City Chiefs': 'KC', 'Las Vegas Raiders': 'LV', 'Los Angeles Chargers': 'LAC',
    'Los Angeles Rams': 'LAR', 'Miami Dolphins': 'MIA', 'Minnesota Vikings': 'MIN', 'New England Patriots': 'NE', 'New Orleans Saints': 'NO', 'New York Giants': 'NYG',
    'New York Jets': 'NYJ', 'Philadelphia Eagles': 'PHI', 'Pittsburgh Steelers': 'PIT', 'San Francisco 49ers': 'SF', 'Seattle Seahawks': 'SEA', 'Tampa Bay Buccaneers': 'TB',
    'Tennessee Titans': 'TEN', 'Washington Commanders': 'WAS'
}

In [107]:
# KEY: Acronym of team
# VALUE: Team name

dict_teams_3 = {
    'ARI': 'Arizona Cardinals', 'ATL': 'Atlanta Falcons', 'BAL': 'Baltimore Ravens', 'BUF': 'Buffalo Bills', 'CAR': 'Carolina Panthers', 'CHI': 'Chicago Bears',
    'CIN': 'Cincinnati Bengals', 'CLE': 'Cleveland Browns', 'DAL': 'Dallas Cowboys', 'DEN': 'Denver Broncos', 'DET': 'Detroit Lions', 'GB': 'Green Bay Packers',
    'HOU': 'Houston Texans', 'IND': 'Indianapolis Colts', 'JAX': 'Jacksonville Jaguars', 'KC': 'Kansas City Chiefs', 'LV': 'Las Vegas Raiders', 'LAC': 'Los Angeles Chargers',
    'LAR': 'Los Angeles Rams', 'MIA': 'Miami Dolphins', 'MIN': 'Minnesota Vikings', 'NE': 'New England Patriots', 'NO': 'New Orleans Saints', 'NYG': 'New York Giants',
    'NYJ': 'New York Jets', 'PHI': 'Philadelphia Eagles', 'PIT': 'Pittsburgh Steelers', 'SF': 'San Francisco 49ers', 'SEA': 'Seattle Seahawks', 'TB': 'Tampa Bay Buccaneers',
    'TEN': 'Tennessee Titans', 'WAS': 'Washington Commanders'
}

## 2. REGULAR EXPRESSIONS

In [108]:
####################################################
# REGULAR EXPRESSIONS USED TO LOCATE SPECIFIC DATA #
####################################################

###########
# GENERAL #
###########

# Players name (Grabs every variation come across so far)
# - I need this to be able to grab 'A.St. Brown' & 'C.Edwards-Helaire' & 'L.Van Ness'
name_pattern = r"(?:[A-Z][a-z]{0,4}\.)+(?:[- ]?[A-Z][a-z]+)+"

spotting_pattern = "(?:([A-Z]+) )?(-?[0-9]+)"

# Injuries (Returns the player(s) who go injuried during play)
injury_pattern = f"[A-Z]+-({name_pattern}) was injured during the play"

# Touchdowns
touchdown_pattern = f"for ([0-9]+) yards?, TOUCHDOWN"

################
# PLAY DETAILS #
################

# Positioning at the end of the play
standard_play_end_pattern = "(?:to|at) (?:([A-Z]+) )?([0-9]+) for (no gain|-?[0-9]+)(?: yards?)?"

###########
# OFFENSE #
###########

# Passer (Player passing, Player spiking, Player who got sacked)
passer_name_pattern = f"({name_pattern}) (?:pass|spiked|sacked)"

# Pass play (Returns intended receiver and the direction of the pass)
receiver_pattern = f"(short|deep) (left|right|middle) (?:to|intended for) ({name_pattern})"

# Rushing play (Player running ball)
rusher_pattern = f"({name_pattern})(?: scrambles)? (?:(left|right|up|kneels)) (?:(the middle|guard|tackle|end))?"

# 2 Point Conversion (Pass attempt)
tp_conversion_pass_pattern = f"({name_pattern}) pass to ({name_pattern})"

# 2 Point Conversion (Rush attempt)
tp_conversion_rush_pattern = f"({name_pattern}) rushes (left|right|up) (the middle|guard|tackle|end)"

###########
# DEFENSE #
###########

# Tackles

# solo / sack
solo_tackle_pattern = rf"\(({name_pattern})\)"

# shared
shared_tackle_pattern = rf"\(({name_pattern}), ({name_pattern})\)"

# shared
assisted_tackle_pattern = rf"\(({name_pattern}); ({name_pattern})\)"

# Pressure (Who applied pressure to passer)
# - I think it might be possible for multiple defenders to apply pressure to the passer.
defense_pressure_name_pattern = rf"\[({name_pattern})\]"

# Split sack (Players who equally received credit for sack)
split_sack_pattern = f"sack split by ({name_pattern}) and ({name_pattern})"

# Defense takeaway (takeaway for yardage)
# D.Hill pushed ob at 50 for 20 yards (J.Wills)
# J.Bates to ATL 49 for no gain (T.Marshall)
defensive_takeaway_run_pattern = f"({name_pattern}) (?:pushed ob at|ran ob at|to)(?: ([A-Z]+))? (-?[0-9]+) for (no gain|-?[0-9]+)(?: yards?)?" # yardage after fumble recovery & yardage after interception

defensive_takeaway_for_touchdown = f"({name_pattern}) for ([0-9]+) yards?, TOUCHDOWN"

# Interception (Player who intercepted pass)
interception_name_pattern = rf"INTERCEPTED by ({name_pattern})(?:[ \t]*(?:\({name_pattern}\)|\[{name_pattern}\]))* at ((?:[A-Z]+ )?[0-9]+)"


#################
# SPECIAL TEAMS #
#################

# Punting play (Who was the punter, How many yards the ball went, Who was the Longsnapper)
punting_pattern = f"({name_pattern}) punts (-?[0-9]+) yards? to(?: ((?:[A-Z]+ )?-?[0-9]+)| end zone), Center-({name_pattern})"

# Punt return resulting in fair catch
punt_fair_catch_pattern = f", fair catch by ({name_pattern})"

# Punt or kickoff downed by
# downed by PHI-S.Brown
kick_downed_by_pattern = f"downed by [A-Z]+-({name_pattern})"

# Kickoff play (Who was the kicker, How many yards the ball was kicked )
kickoff_pattern = f"({name_pattern}) kicks(?: onside)? (-?[0-9]+) yards from ((?:[A-Z]+ )?[0-9]+) to ((?:[A-Z]+ )?-?[0-9]+|end zone)"

# R.James MUFFS catch
# muffed catch during punt or kickoff return
muffed_catch_pattern = f"({name_pattern}) MUFFS catch"

# Field goal (Good OR No Good)
field_goal_pattern = f"({name_pattern}) (-?[0-9]+) yard field goal is (?:GOOD|No Good),(?: ([A-Za-z]+(?: [A-Za-z]+)*),)? Center-({name_pattern}), Holder-({name_pattern})."

# Field goal (Blocked)
# — C.McLaughlin 40 yard field goal is BLOCKED (R.Green), Center-Z.Triner, Holder-J.Camarda, recovered by TB-J.Camarda at 50.
field_goal_blocked_pattern = rf"({name_pattern}) (-?[0-9]+) yard field goal is BLOCKED \(({name_pattern})\), Center-({name_pattern}), Holder-({name_pattern}), (?:RECOVERED|recovered) by ([A-Z]+)-({name_pattern}) at ((?:[A-Z]+ )?[0-9]+)"

# Extra Point (Good OR No Good)
extra_point_pattern = f"({name_pattern}) extra point is (?:GOOD|No Good),(?: ([A-Za-z]+(?: [A-Za-z]+)*),)? Center-({name_pattern}), Holder-({name_pattern})."

## 3. TRANSFORMING DATA

In [109]:
# PURPOSE:
# - Take value for 'PlayTimeFormation' and split into 3 separate features.
#   1. GameClock (Will come about when renaming 'PlayTimeFormation')
#   2. Quarter (This feature already exists, the values within 'PlayTimeFormation' are more accurate and will replace the value in here originaly)
#   3. Formation

def playtimeformation_split(df_plays):

  df_plays_copy = df_plays.copy()

  new_columns = ['Formation']

  df_plays_copy = df_plays_copy.rename(columns = {'PlayTimeFormation': 'GameClock'})

  df_plays = df_plays.reindex(columns=df_plays.columns.tolist() + new_columns)

  # Splitting original feauture 'PlayTimeFormation' (Now known as 'TimeLeftInQuarter')
  for idx, play in df_plays_copy['GameClock'].items():
    value_elements = play.split(' ')
    # Some plays (e.g. Kickoff) will only have the formation as a value
    if len(value_elements) <= 1:
      df_plays_copy.at[idx, 'Formation'] = value_elements[0]
      df_plays_copy.at[idx, 'GameClock'] = ""
    else:
      df_plays_copy.at[idx, 'GameClock'] = value_elements[0]
      df_plays_copy.at[idx, 'Quarter'] = value_elements[1]
      df_plays_copy.at[idx, 'Formation'] = " ".join(value_elements[2::])

  # Transform values in 'Quarter' feature from string to integer (e.g. '1st Quarter' -> 1)
  dict_replace_quarter = {'1st Quarter': 1, '2nd Quarter': 2, '3rd Quarter': 3, '4th Quarter': 4,
                          '1st': 1, '2nd': 2, '3rd': 3, '4th': 4}

  # All overtime quarters will be have the value 5 in their place
  df_plays_copy['Quarter'] = df_plays_copy['Quarter'].map(dict_replace_quarter).fillna(5).astype(int)

  return df_plays_copy

# PURPOSE:
# - Take value for 'PlayStart' and split into 2 separate features.
#   1. DownAndDistance (Will come about when renaming 'PlayStart')
#   2. FieldPosition (Start of play)

def playstart_split(df_plays):

  df_plays_copy = df_plays.copy()

  new_columns = ['FieldPosition']

  df_plays_copy = df_plays_copy.rename(columns = {'PlayStart': 'DownAndDistance'})

  df_plays_copy = df_plays_copy.reindex(columns=df_plays_copy.columns.tolist() + new_columns)

  df_plays_copy['FieldPosition'] = df_plays_copy['FieldPosition'].astype(str)

  # Splitting original feature 'PlayStart' (Now known as 'DownAndDistance')
  for idx, play in df_plays_copy['DownAndDistance'].items():
    # Some plays to not have a down and distance or field position and contain 'nan' values here,
    # this is to catcht those plays and keep going. (e.g. Kickoff / Extra Point / etc..)
    if pd.isna(play):
      continue
    else:
      value_elements = play.split(' at ')
      df_plays_copy.at[idx, 'DownAndDistance'] = value_elements[0]
      df_plays_copy.at[idx, 'FieldPosition'] = value_elements[1]

  return df_plays_copy

# PURPOSE:
# - Keep consistence with team names
#   - A team name will always be represented by their acronym

def consistent_team_names(df_plays):

  df_plays_copy = df_plays.copy()

  df_plays_copy['AwayTeam'] = df_plays_copy['AwayTeam'].map(dict_teams)
  df_plays_copy['HomeTeam'] = df_plays_copy['HomeTeam'].map(dict_teams)
  df_plays_copy['TeamWithPossession'] = df_plays_copy['TeamWithPossession'].map(dict_teams_2)

  return df_plays_copy

## 4. CLEANING METHODS

### HELPER CLEANING METHODS

#### SPLIT PLAY DESCRIPTION INTO SENTENCES

In [110]:
# PURPOSE:
# - Function will split the feature "PlayDescription" into
# its individual sentences and place them in a list.

# - I am using playdescription.split(". ") to separate
#   sentences within play description. The problem here
#   is that sometimes a player will have ". " within their
#   name, causing a sentence to split into 2 with the
#   divide being in the middle of the players name. To
#   overcome this, I will replace the ". " character
#   combination within player names with the string
#   "<DOT>" then split play description into separate
#   sentences and revert the player names back to normal
#   after the split.

def split_play_description(play_description):

  # Finding all player names that were mentioned in the play
  player_names = re.findall(name_pattern, play_description)

  # Creating map for player names that have ". " within their name
  # and mapping them to a safe replacement name for the time being.
  replacements = {}
  for name in player_names:
    if ". " in name:
      protected_name = name.replace(". ", "<DOT>")
      replacements[name] = protected_name

  # Replacing original player name with safe replacement name in
  # play description
  for original, protected in replacements.items():
    play_description = play_description.replace(original, protected)

  # Splitting play description by ". "
  play_split = play_description.split(". ")

  # Revert player names back to normal in play_split
  restored_names = [s.replace("<DOT>", ". ") for s in play_split]

  return restored_names

#### YARDAGE BETWEEN SPOTTINGS

In [111]:
# UPDATED YARDAGE BETWEEN SPOTTINGS METHOD.

# Within this dataset, a team scores by reaching the opposing team's endzone.
# - This does not alternate, they will never have to strive for their own
#   endzone to score.
#   - Their zone will always be 100-51 yards away from their target endzone
#   - Their opposing zone will always be 49-0 yards away from their target
#     endzone.

# GOAL:
# - For this method, I would like to calculate the distance between two field
#   positions (or spottings).

# PLAN:
# - INPUTS NEEDED:
#   1. Team with possession
#      - Need to be cautious when the ball is overturned
#        - (loss fumble, interception, punt return, kickoff return, etc..)
#          - Need to make sure that plays such as this have the feature
#            'TeamWithPossession' reflecting the team with the ball during that
#            time. During a single play, it is possible for both teams to
#            alternate possession of the ball.
#   2. start_spotting
#   3. end_spotting

# ALGORITHM:
# - IDEA:
#   - Both spottings (start and end) will be converted to numbers on a scale
#     from 0-100 depending on how far away the spotting is from the target
#     endzone. From there, I will subtract 'start_spotting' by 'end_spotting'
#     to receive yardage gained.
# 1. Compare team with possession to zone of spotting and convert spottings
#    - EXAMPLES
#      - TeamWithPossession = 'BUF'
#      - start_spotting = 'BUF 20'
#        - 'BUF 20' ->
#          - start_zone = 'BUF'
#          - start_yardage = 20
#        - TeamWithPossession == start_zone: (TRUE)
#          - converted_start_spotting = 100 - start_yardage (20)
#            -> converted_start_spotting = 80 (yards from target endzone)
#      - end_spotting = 'KC 35'
#        - 'KC 35' ->
#          - end_zone = 'KC'
#          - end_yardage = 35
#        - TeamWithPossession == start_zone: (FALSE)
#          - converted_end_spotting = end_yardage (35)
#            -> converted_start_spotting = 35 (yards from target endzone)

def yardage_between_spottings(team_with_possession, start_spotting, description_with_end_spotting):

  ##################
  # START SPOTTING #
  ##################

  # Breaking down 'start_spotting' ('BUF 20' -> [('BUF', 20)])
  start_elements = re.findall(spotting_pattern, start_spotting)
  start_territory = start_elements[0][0]
  start_yardage = int(start_elements[0][1])

  # converting start_yardage to 0-100 scale
  if start_territory == team_with_possession:
    start_yardage = 100 - start_yardage

  ##########################################
  # END SPOTTING AND PSUEDO YARDAGE GAINED #
  ##########################################

  # touchdown
  touchdown = re.findall(touchdown_pattern, description_with_end_spotting)
  if touchdown:
    # print(start_yardage)
    return start_yardage

  # Breaking down 'end_spotting'
  end_spotting = re.findall(standard_play_end_pattern, description_with_end_spotting)
  if end_spotting:
    end_territory = end_spotting[0][0]
    end_yardage = int(end_spotting[0][1])

    # converting end_yardage to 0-100 scale
    if end_territory == team_with_possession:
      end_yardage = 100 - end_yardage

  # PLAY FAILED (e.g. pass incomplete)
  else:
    # print(0)
    return 0

  # print(start_yardage - end_yardage)
  return start_yardage - end_yardage

In [112]:
# # THIS METHOD IS STUUUPPPIIIIDDDD!!!
# # - AN EXAMPLE OF WHY IT IS IMPORTANT TO UNDERSTAND YOUR DATASET BEFORE
# #   TRYING TO CLEAN IT.
# #   - A team is only able to score in the opposing teams zone. It does not
# #     alternate (between quarters) and they have to aim for their zone for
# #     a touchdown.

# # PURPOSE:
# # - Calculate the yardage between two spottings



# # MOST BENEFICIAL WHERE:
# # 1. fumbled plays
# # 2. penalty plays

# # CONCERNS
# # 1. Should I only use this method for plays that absolutely need it?
# #    - This seems like it would be a lengthy process having to go through
# #      this method for each play.

# # WHAT I NEED
# # 1. start spotting
# # 2. end spotting
# # 3. direction to goal

# # FEATURES THAT COULD HELP:

# # - STRICTLY FOR DIRECITON
# #   1. dataframe of plays (NOT IMPLEMENTED IN FIRST ITERATION)
# #   2. play index (NOT IMPLEMENTED IN FIRST ITERATION)
# #      - might need to reference other plays in the drive or quarter
# #      DESIGN NOTE:
# #      - The index does not have to be the original from the dataframe of plays,
# #        the dataframe of plays does have to be original. I just need to be able
# #        to grab features from this play being looked at to reference other plays
# #        within the dataframe of plays.

# # - BREAD AND BUTTER (most will only need these)
# #   3. description of action within play (could be a slice of a single play)
# #      - This is where I will find the 'end spotting'.
# #      - Some plays will have multiple actions with different yardage gains in them.
# #        I need to pinpoint which action I am looking at specifically
# #   4. start spotting
# #      - Because of the multiple actions nature of some of these plays
# #        (fumbles / penalties) I will need to locate the start spotting before
# #        hand.
# #      DESIGN NOTE:
# #      - I may have to cycle regular expressions to find the correct end spotting

# # DESIGN MENTALITY:
# # - Iterate over time.

# def yardage_between_spottings(df_plays, play_index, start_spotting, description_with_end_spotting):

#   # DIRECTION
#   # - I need to figure out which zone is past the 50 and which zone is within the 50 for the team
#   #   with the ball. (e.g. 'BUF' is 100-51, 'KC' is 49-0, 50 is neutral)
#   #   - I will find this by looking at
#   #     1. start spotting
#   #     2. end spotting
#   #     3. yardage gained between
#   #        - Majority of play descriptions will have the 'end spotting' and
#   #          'yardage gained between'. These are essential and if they are not
#   #          located within the passed in 'description_with_end_spotting' then
#   #          that is when I will need to look at another play within this quarter.

#   # DESIGN
#   # - Every spotting will have both the zone and the yardage (e.g. 'BUF 20')
#   #   - I want all spottings to be on a 100 point scale to represent the length of
#   #     the field, the zone will aid in this.
#   #     - EXAMPLE:
#   #       - 'BUF 20'
#   #       - (BUF zone is 100-51)
#   #         - 100 - 20 = 80 yards to endzone
#   #       - (BUF zone is 49-0)
#   #         - 20 yard to endzone
#   #   - The reason for doing this is so that I will be able to tell, given 2 spottings,
#   #     whether it was a negative gain vs positive.
#   #     - EXAMPLE:
#   #       - start_spotting = BUF 20
#   #       - end_spotting   = BUF 30
#   #       - (BUF zone is 100-51)
#   #         - start_spotting = 100 - 20 = 80 yards until endzone
#   #         - end_spotting   = 100 - 30 = 70 yards until endzone
#   #           - yardage gained = 80 - 70 = 10 yards gained
#   #       - start_spotting = BUF 20
#   #       - end_spotting   = BUF 30
#   #       - (BUF zone is 49-0)
#   #         - start_spotting = 20 yards until endzone
#   #         - end_spotting   = 30 yards until endzone
#   #           - yardage gained = 20 - 30 = -10 yards gained

#   # LOCATE
#   # start_territory
#   # start_yardage
#   # end_territory
#   # end_yardage
#   # pseudo_play_yardage

#   ##################
#   # START SPOTTING #
#   ##################

#   # Splitting start_spotting (e.x. [['BUF'], ['20']])
#   start_elements = re.findall(spotting_pattern, start_spotting)
#   start_territory = start_elements[0][0]
#   start_yardage = int(start_elements[0][1])
#   # print(start_territory)
#   # print(start_yardage)

#   ##########################################
#   # END SPOTTING AND PSUEDO YARDAGE GAINED #
#   ##########################################
#   # - end_territory
#   # - end_yardage
#   # - psuedo_play_yardage

#   # (e.x. [['BUF'], ['30'], ['10']])
#   end_spotting_and_play_yardage = re.findall(standard_play_end_pattern, description_with_end_spotting)
#   # (e.x. ['5'])
#   touchdown = re.findall(touchdown_pattern, description_with_end_spotting)
#   # STANDARD
#   if end_spotting_and_play_yardage:
#     # Grabbing yardage from play description
#     # - Return immediately if no gain on play
#     if end_spotting_and_play_yardage[0][2] == 'no gain':
#       return 0
#     else:
#       end_territory = end_spotting_and_play_yardage[0][0]
#       end_yardage = int(end_spotting_and_play_yardage[0][1])
#       pseudo_play_yardage = int(end_spotting_and_play_yardage[0][2])
#   # TOUCHDOWN
#   elif touchdown:
#     pseudo_play_yardage = int(touchdown[0])
#     # Same zone touchdown
#     if int(touchdown[0]) < 50:
#       end_territory = start_territory
#     # Opposite zone touchdown
#     if int(touchdown[0]) > 50:
#       # touchdown was in opposite zone as start zone
#       if df_plays['HomeTeam'].loc[play_index] == start_territory:
#         end_territory = df_plays['AwayTeam'].loc[play_index]
#       else:
#         end_territory = df_plays['HomeTeam'].loc[play_index]
#     # 50 yard line
#     if int(touchdown[0]) == 50:
#       end_territory = None
#     end_yardage = 0
#   # PLAY FAILED (e.g. pass incomplete)
#   else:
#     return 0


#   #######################
#   # CALCULATING YARDAGE #
#   #######################

#   # PLAN ON HOW TO LOCATE ZONES
#   # 1. spotting_difference
#   #    - ( start_spotting - end_spotting )
#   # 2. pseudo_play_yardage
#   #    - Was the yardage recorded in the play
#   #      description positive or negative?

#   # SCHEMATIC? BLUEPRINT? I cant figure out the right word.
#   # Standard cases (start position and end position are in the same zone):
#   # spotting_difference (+) & pseudo_play_yardage (+):
#   # - the start position team zone (49-0)
#   # spotting_difference (-) & pseudo_play_yardage (+):
#   # - the start position team zone (100-51)
#   # spotting_difference (+) & pseudo_play_yardage (-):
#   # - the start position team zone (100-51)
#   # spotting_difference (-) & pseudo_play_yardage (-):
#   # - the start position team zone (49-0)

#   # Unique cass (start position and ending position are in different zones):
#   # zones switch (e.g. KC 47 -> BUF 47)
#   # pseudo_play_yardage (+):
#   # - the start position team zone (100-51)
#   # pseudo_play_yardage (-):
#   # - the start position team zone (49-0)

#   # Standard cases
#   if (start_territory == end_territory):
#     # spotting_difference (+)
#     if start_yardage > end_yardage:
#       # pseudo_play_yardage (+)
#       # starting position 49-0 zone
#       if pseudo_play_yardage > 0:
#         starting_position = start_yardage
#         ending_position = end_yardage
#       # pseudo_play_yardage (-)
#       # starting position 100-51
#       else:
#         starting_position = 100 - start_yardage
#         ending_position = 100 - end_yardage
#     # spotting_difference (-)
#     else:
#       # pseudo_play_yardage (+)
#       # starting position 100-51
#       if pseudo_play_yardage > 0:
#         starting_position = 100 - start_yardage
#         ending_position = 100 - end_yardage
#       # pseudo_play_yardage (-)
#       # starting position 49-0
#       else:
#         starting_position = start_yardage
#         ending_position = end_yardage
#   else:
#     # pseudo_play_yardage (+)
#     # starting position 100-51
#     if pseudo_play_yardage > 0:
#       starting_position = 100 - start_yardage
#       ending_position = end_yardage
#     # pseudo_play_yardage (-)
#     # starting position 49-0
#     else:
#       starting_position = start_yardage
#       ending_position = 100 - end_yardage


#   # # DESIGN CHECK. (Checking for accuracy)
#   # if pseudo_play_yardage != int(starting_position) - int(ending_position):
#   #   print(pseudo_play_yardage)
#   #   print(int(starting_position) - int(ending_position))
#   #   raise ValueError(f"Yardage mismatch at play_index {play_index}, \"{description_with_end_spotting}\"")

#   return int(starting_position) - int(ending_position)

#### FUMBLES

In [168]:
# I think I want to have the elements within the list formatte as
# [start_spotting, play description with end_spotting, end_spotting]
# - (end_spotting)
#   - Sometimes the end spotting within the play description gets adjusted
#     based on what happens after the ball leaves the players hand.
#     - EXAMPLES:
#       - player fumbles and same team player recovers fumble behind spotting
#         - end_spotting = spotting of recovery
#       - player fumbles and same team player recovers fumble beyond spotting
#         - end_spotting = fumble spotting
#       - player fumbles and opp team player recovers fumble
#         - end_spotting = fumble spotting

# PURPOSE:
# - A method that will clean fumbled plays for every play type

# INPUT:
# - Dataframe of plays & index of fumbled play

# OUTPUT:
# - Potential multi-row dataframe that will contain every action (possession)
#   that occured during the play. (e.g. rush/fumble recovery for yards/etc..)

# DESIGN IDEA (STEP BY STEP):
# 1. Receive single row dataframe of fumble play
# 2. Split play description into separate sentences
# 3. Group split sentences in a way where each grouping contains all information
#    needed to complete a single row in the return dataframe.
#    - There are cases where a single sentence will have information that is
#      needed for multiple rows.
#      EXAMPLE (SENTENCE NEEDED FOR MULTIPLE ROWS):
#      - "FUMBLES (D.White) [D.White], RECOVERED by TB-C.Izien at CHI 48"
#        - This sentence contains:
#          1. Who forced the fumble ................ '(D.White)'
#          2. Who applied pressure..? .............. '[D.White]'
#          3. What team recovered the fumble ....... 'TB'
#          4. Who recovered the fumble ............. 'C.Izien'
#          5. The spotting of the fumble recovery .. 'CHI 48'
#          - The sentence before this one was some sort of action or play
 #           (run/pass/etc..)
#            - ROW 1
#              - Will need:
#                1. Who forced the fumble ................ '(D.White)'
#                2. Who applied pressure..? .............. '[D.White]'
#                3. What team recovered the fumble ....... 'TB'
#                4. Who recovered the fumble ............. 'C.Izien'
#                5. The spotting of the fumble recovery .. 'CHI 48'
#          - The sentence after this could be a run after recovery
#            - ROW 2
#              - Will need:
#                5. The spotting of the fumble recovery .. 'CHI 48'
#                   - For the start spotting of the run after recovery
# 4. Clean grouped sentences into accurate and useable data
#    - Each grouping of sentences will be a row in the return dataframe
# 5. Return the new clean fumbled play dataframe

# THOUGHTS:
# 1 - I THINK that the initial player who fumbled and all players that follow
#     are cleaned differently.
#     - The yardage recorded by the initial player is a bit different than all
#       players that fumble and recover after.
#     - There is more than just this but this is all I can think of right now.
# 2 - Rules for how yardage is recorded:
#     - If a player fumbles and the person who recovers the fumble is:
#       1. On the same team
#          -> Yardage ends at the spotting of the recovery
#       2. On the opposing team
#          -> Yardage ends at the spotting of the fumble
#     - If a player catches a pass
#       -> fumbles
#          -> recovers own fumble
#             -> rushes for extra yards
#                - Yardage for this player is from the LOS -> down
#                - Yardage for the player who passed the ball if from
#                  LOS -> down (Same as receiving yards from receiver)
#                - Receiver is credited with a fumble
#     - If a fumble occurs behind the LOS
#       -> recovered by same team
#          -> recovered behind LOS
#             -> play is done
#                = player that fumbled receives (-) yards
#                  - Including all players who might have recovered and fumbled
#                    before final player who recovered behind LOS.
#             -> player rushes beyond LOS
#                = initial player receives 0 yards
#                  - Including all players who might have recovered and fumbled
#                    before final player who crossed LOS.
#        -> recovered beyond LOS
#           = initial player receives 0 yards? <- double check this
# 3 - Return dataframe format
#     - For each row of the return dataframe, somewhere within one of the
#       feauture values, I need to include a fraction that will show what row
#       number the action is.
# 4 - I need a summary row for certain playtypes
#     - I forget for which ones, but I know I need it

def helper_clean_fumble_play(df_plays, play_index):

  print(play_index)

  ###################################################
  # 1. RECEIVE SINGLE ROW DATAFRAME OF FUMBLED PLAY #
  ###################################################

  # copy of original play row
  df_play = df_plays.loc[play_index].copy()

  #############################
  # 2. SPLIT PLAY DESCRIPTION #
  #############################

  play_description = df_play['PlayDescription']

  ############
  # REVERSES #
  ############
  # In 'PlayDescription' all information before the "reversed" sentence is not needed.
  # - All information before is stored within 'ReverseDetails' and the remaining is cleaned.
  if play_description.find('REVERSED') != -1:
    play_elements = split_play_description(play_description)
    for i in play_elements:
      if i.find("REVERSED") != -1:
        df_play['ReverseDetails'] = play_elements[:play_elements.index(i) + 1]
        play_description = ". ".join(play_elements[play_elements.index(i) + 1:])
        break

  # Splitting 'PlayDescription' into a list of actions (sentences)
  play_description_split = split_play_description(play_description)

  play_description_length = "- 1. PLAY DESCRIPTION -"
  print("-" * len(play_description_length))
  print(play_description_length)
  print("-" * len(play_description_length))
  print(play_description)
  print()

  play_description_split_length = "- 2. PLAY DESCRIPTION SPLIT -"
  print("-" * len(play_description_split_length))
  print(play_description_split_length)
  print("-" * len(play_description_split_length))
  for i in play_description_split:
    print(i)
  print()

  ############################
  # 3. GROUP SPLIT SENTENCES #
  ############################

  # List of actions in play
  # - Will be a 2D list (list of lists)
  # - Every element within this list (each element is a list of its own) will
  #   have a grouping of sentences that will represent a single action within
  #   the play
  play_split = []

  # - I would like for each element in the list to have everything that it needs
  #   to complete a row.
  #   - This means that every element will have:
  #     1. Start spotting (LOS or recovery spotting)
  #     2. End spotting (down or fumble spotting)
  #     3. Player with ball
  #     4. Player who made tackle

  # - I will organize each of these sentences by checking each one,
  #   in cronological order, and determining whether the sentence:
  #   1. Is the start of a new row (primary action)
  #      - rushes/passes/etc..
  #   2. Is an essential addition to a row (secondary action)
  #       - fumble recoveries / etc..
  #       - These sentences could also start a new row
  #         - fumble recovery means a potential start to a new rush attempt
  #           - this is part of having all information grouped together for
  #             a single row

  # Cycling through every action (sentence) within play
  while (play_description_split):

    # If a fumble recovery for action happens
    action_after_fumble_recovery = False

    #####################
    # EXTRA INFORMATION #
    #####################
    # - Extra information that does not have to do with the play itself
    #   (e.g. injuries/penalties/eligibility/etc..)

    #####################
    # SECONDARY ACTIONS #
    #####################
    # - The reason why this is checked first is because of formatting. For
    #   fumble recoveries that result in an attempt to gain yards, I would like
    #   for that sentence to be the start of a new element because it will
    #   provide the start spotting for what comes next (rush/pass/etc..).

    # - Center fumbled
    #   - Could possibly take this sentence out completely, I do not think
    #     there is any useful information in it.
    #     - (e.g. "P.Mahomes Aborted.")
    #       - P.Mahomes is not at fault for the fumble, his center is.
    if ' aborted' in play_description_split[0].lower():
      play_split.append([play_description_split[0]])
      # - The reason for taking this out here is because it is not required
      #   for any other row grouping.
      play_description_split.pop(0)
      continue

    # - Sentence containing fumble description
    if 'fumble' in play_description_split[0].lower() or 'muffs' in play_description_split[0].lower():

      # - Quarterback fumbled
      if '(aborted)' in play_description_split[0].lower():
        play_split.append([play_description_split[0]])
      # - fumbled during catching of punt of kickoff
      elif 'muffs' in play_description_split[0].lower():
        play_split.append([play_description_split[0]])
      else:

        # - Added onto element in list with primary action
        play_split[len(play_split) - 1].append(play_description_split[0])

      # - If the sentence is one that contains recovery information and is
      #   followed by more actions, it will be the start of a new row
      #   - This could be a problem in the future when there are penalties
      #     and injuries.
      if 'recover' in play_description_split[0].lower() and len(play_description_split) > 1:

        # 1. Only grabbing the part of the sentence that has the recovery
        #    information in it.
        #    EXAMPLE:
        #    - "— P.Mahomes Aborted', 'C.Humphrey FUMBLES at KC 39, touched at
        #         KC 38, recovered by KC-P.Mahomes at KC 40"
        #      - only want "recovered by KC-P.Mahomes at KC 40"
        # 2. Set recovery info as start of new grouping for new row
        # 3. Pop out recovery sentence, next sentence will be primary action
        #    - (e.g. pass/run/etc..)
        recover_info = play_description_split[0].split(", ")
        play_split.append([recover_info[len(recover_info) - 1]])
        play_description_split.pop(0)

        # - This is a fumble recovery for action, so the primary action will
        #   need to append itself to this newly created element within the list
        action_after_fumble_recovery = True

    ###################
    # PRIMARY ACTIONS #
    ###################
    # - Primary actions are sentences that have details about a player rushing
    #   or passing the ball.

    # - Sentence that contains information of an attempt to gain yards
    for play_pattern in [passer_name_pattern, standard_play_end_pattern]:
      if re.search(play_pattern, play_description_split[0]):
        # - Attempt to gain yards after a fumble recovery
        if action_after_fumble_recovery:
          play_split[len(play_split) - 1].append(play_description_split[0])
        # - Initial attempt to gain yards (Probably before fumble occured)
        else:
          play_split.append([play_description_split[0]])
        break

    # - Sentence should be taken care of, onto the next sentence in the play
    play_description_split.pop(0)

  play_description_grouped_length = "- 3. PLAYDESCRIPTION GROUPED -"
  print("-" * len(play_description_grouped_length))
  print(play_description_grouped_length)
  print("-" * len(play_description_grouped_length))
  for j in play_split:
    print(j)
  print()
  print()

  ##############################
  # 4. CLEAN GROUPED SENTENCES #
  ##############################

  return

#### LATERALS

In [164]:
# Lateral cleaning method

# NOTES AND THOUGHTS:
# - I think yardage gained here is similar to yardage gained during fumbles
#   - What I think the rules for yardage rewarded are:
#     1. Yardage awarded = Start_Spotting -> End_Spotting
#        - Start spotting could be:
#          1. line of scrimmage
#          2. From where the ball was received during a lateral
#        - End spotting could be:
#          1. Tackled
#          2. Spotting of player lateralling ball
#          3. Run out of bounds
#          4. touchdown
#          5. Fumble
#             - Hopefully this will be taken care of when using a
#               main cleaning method.
#               - Exactly like fumbles, this lateral cleaning method will
#                 organize the play into segments (each segment representing a
#                 separate player in possession of the ball within the play
#                 doing whatever. But it will always be from the start of when
#                 they get the ball to the end.)
#                 - Each segment will be cleaned using a main cleaning method,
#                   most will be cleaned using the 'clean_run_plays' method but
#                   it can be whatever.
#                   - Which now that I think about it, I have not thought of
#                     other main cleaning methods to use other than the
#                     'clean_run_plays' method. In most cases, after a player
#                     fumbles, the one who picks it up will run with the ball.
#                     But this is not always the case. They could pass, punt,
#                     throw an interception, anything.
#                     - It might be a good idea to come up with a
#                       'what_does_action_look_like' method. I could send in a
#                       play, or a segment of a play and it will clean the play
#                       based off of what it looks like and return the cleaned
#                       action in a single row dataframe. Or could even just
#                       return the cleaning method itself.
#                 - When all segments are cleaned, the idea is to bring them
#                   all together in a single dataframe and replace that
#                   single play row within the main dataframe of plays.
#          6. I am sure there are more

#             SIDENOTE:
#             - The whole point of identifying 'start_spotting' and
#               'end_spotting' is to place parameters on a single segment of
#               a play.
#               - [start of player possession,
#                  everything in between,
#                  end of player possession]

#     WHERE THE REAL CURVEBALL COMES:
#     - Sometimes, yardage calculated by one player can have an effect on
#       the next. I do not want to brute force this, I feel like it would
#       be chunky and inefficient.
#       - Below is how yardage gained from one can affect another

#     2. Does the ball cross the line of scrimmage?

#        a. The ball remains behind the line of scrimmage:
#          -  I THINK:
#             The initial player who laterals the ball
#             - will be awarded yardage from the
#               line of scrimmage (start_spotting)
#               ->
#               spotting of the lateral (end_spotting).
#          - I THINK:
#            The player who received the lateral (?)
#            - POSSIBLE OUTCOMES
#              1. Receive 0 yardage
#                 line of scrimmage (start_spotting)
#                 OR
#                 spotting of reception (stat_spotting) <- more likely Im guessing
#                 ->
#                 some end spotting behind LOS (end_spotting)
#                 =
#                 0.0
#              2. OR they receive yardage
#                 line of scrimmage (start_spotting)
#                 OR
#                 spotting of reception (stat_spotting) <- more likely Im guessing
#                 ->
#                 some end spotting behind LOS (end_spotting)
#                 =
#                 Distance between (start_spotting -> end_spotting)
#
#        b. The ball lateral behind the line of scrimmage THEN
#           ball crosses line of scrimmage:
#          -  I THINK:
#             The initial player who laterals the ball
#             - will be awarded yardage from the
#               line of scrimmage (start_spotting)
#               ->
#               spotting of the lateral (end_spotting)
#               (UNTIL) ball goes over line of scrimmage in same play
#               = receives 0.0 yards
#          - I THINK:
#            The player who received the lateral
#             - will be awarded yardage from the
#               line of scrimmage (start_spotting)
#               ->
#               some end spotting behind LOS (end_spotting)

#        c. The ball laterals beyond the line of scrimmage THEN
#           remains beyond line of scrimmage:
#          -  I THINK:
#             The initial player who laterals the ball
#             - will be awarded yardage from the
#               line of scrimmage (start_spotting)
#               ->
#               spotting of the lateral (end_spotting).
#          - I THINK:
#            The player who received the lateral
#             - will be awarded yardage from the
#               spotting of reception (stat_spotting) <- more likely Im guessing
#               ->
#               some end spotting behind LOS (end_spotting)

# Goal right now is to display the play and organize the play just like
# it is during the fumble cleaning method.

# CURRENT GOAL:
# 1. display play
# 2. display split play (play split up by sentence)
# 3. display organized grouping of plays
#    - each grouping will become a single row within
#      the return dataframe that will, as a whole,
#      represent the entire play

def helper_clean_lateral_play(df_plays, play_index):

  # print(play_index)

  df_play = df_plays.loc[play_index].copy()

  play_description = df_play['PlayDescription']

  ############
  # REVERSES #
  ############
  # In 'PlayDescription' all information before the "reversed" sentence is not needed.
  # - All information before is stored within 'ReverseDetails' and the remaining is cleaned.
  if play_description.find('REVERSED') != -1:
    play_elements = split_play_description(play_description)
    for i in play_elements:
      if i.find("REVERSED") != -1:
        df_play['ReverseDetails'] = play_elements[:play_elements.index(i) + 1]
        play_description = ". ".join(play_elements[play_elements.index(i) + 1:])
        break

  # Splitting play by sentences
  play_description_split = split_play_description(play_description)

  # # 1. display play
  # play_description_length = "- 1. PLAY DESCRIPTION -"
  # print("-" * len(play_description_length))
  # print(play_description_length)
  # print("-" * len(play_description_length))
  # print(play_description)
  # print()

  # # 2. display split play (play split up by sentence)
  # play_description_split_length = "- 2. PLAY DESCRIPTION SPLIT -"
  # print("-" * len(play_description_split_length))
  # print(play_description_split_length)
  # print("-" * len(play_description_split_length))
  # for i in play_description_split:
  #   print(i)
  # print()

  # 3. display organized grouping of plays

  # ORGANIZING GROUPING OF PLAYS:
  # - REMINDER:
  #   - A list will hold all groupings of individual actions (that being each
  #     player that has possession of the ball from start to finish). I want
  #     each grouping, or element in the list, to have all information needed
  #     to create an individual single dataframe row that will represent the
  #     action individually. These individual actions will eventually come
  #     together as a single dataframe and will represent the play entirely.
  # - PLAN TO ACHIEVE COLLECTING ALL INFORMATION FOR AN INDIVIDUAL ACTION:
  #   - Each element of the list of actions should look something like this,
  #     [start_spotting, everything in between, play description with end_spotting]
  #     - (start_spotting)
  #       - line of scrimmage
  #       - reception of lateral
  #     - (everything in between)
  #       - ? I do not know if I have come across anything that is in between
  #         yet. But I am not ruling it out.
  #     - (play description with end_spotting)
  #       - This will have information on the play itself and will be cleaned
  #         using a main cleaning method.

  #       - Eventually when I merge this idea with fumbles, sometimes actions
  #         that follow will affect the 'end_spotting' of an action.
  #         (e.g. fumble behind LOS, a player on same team eventually crosses
  #               LOS, rewarded initial player who fumbles 0 yardage.)


  # - I am having trouble trying to figure out 'start_spotting' for certain
  #   plays, specifically ones that have to do with laterals or fumbles that
  #   occur behind the line of scrimmage.
  #   - If a player that receives a lateral behind the line of scrimmage and
  #     their run ends behind the line of scrimmage, does he gain -
  #     a. yardage from the reception of the lateral to the end of the carry
  #     b. yardage from the LOS to the end of the carry
  #     c. 0 yardage
  #   - If a player that receives a lateral behind the line of scrimmage and
  #     their run goes beyond the line of scrimmage, does he gain -
  #     a. yardage from the LOS to the end of the carry

  # THINKING ON PAGE:
  # - start_spotting for lateral behind LOS -> end carry beyond LOS
  #   - If the lateral spotting (start_spotting for the next player) is
  #     behind the LOS -> start_spotting will remain as the LOS.

  # - calculating yardage
  #   - Could I mess with the play description to calculate yardage?
  #     - In 'clean_run_plays' method, I can already input start_spotting
  #       and have the yardage calculated from there.
  #       - Would it be beneficial to manipulate the end_spotting in
  #         play descriptions so it calculates from start_spotting -> end_spotting?

  play_split = []

  start_spotting = df_play['FieldPosition']
  # print(start_spotting)

  # Cycling through every action (sentence) within play
  while (play_description_split):

    ###################
    # PRIMARY ACTIONS #
    ###################
    # - Primary actions are sentences that have details about a player rushing
    #   or passing the ball.

    # - Sentence that contains information of an attempt to gain yards
    # for play_pattern in [passer_name_pattern, standard_play_end_pattern]:
    for play_pattern in [standard_play_end_pattern]:
      standard_play = re.findall(play_pattern, play_description_split[0])
      if standard_play:
        # print(standard_play)
        play_split.append([start_spotting])
        play_split[len(play_split) - 1].append(play_description_split[0])
        start_spotting = " ".join(standard_play[0][:2:])
        break

    # - Sentence should be taken care of, onto the next sentence in the play
    play_description_split.pop(0)

  # play_description_grouped_length = "- 3. PLAYDESCRIPTION GROUPED -"
  # print("-" * len(play_description_grouped_length))
  # print(play_description_grouped_length)
  # print("-" * len(play_description_grouped_length))
  # for j in play_split:
  #   print(j)
  # print()


  return

### OFFENSIVE CLEANING METHODS

#### PASS PLAYS

In [115]:
# PURPOSE:
# - Clean all passing play types
# INPUT PARAMETERS:
# df_plays    - dataframe - NFL plays
# index_start -  integer  - index in the dataframe of NFL plays where the method
#                           will start cleaning in ascending order.
# RETURN:
# df_plays - dataframe - the same input df_plays but with all passing play types cleaned

def clean_pass_plays(df_plays, index_start = None):

  # Adjusting df_plays to start cleaning at a specified index (index_start)
  if index_start != None:
    # Locating all passing type plays (starting from 'index_start')
    df_plays_adjusted = df_plays.loc[index_start:]
    df_pass_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('Pass')]
  else:
    # Locating all passing type plays (From entire input dataframe)
    df_pass_plays = df_plays[df_plays['PlayOutcome'].str.contains('Pass')]

  for idx, play in df_pass_plays['PlayDescription'].items():

    # print(idx)
    # print(play)

    ################
    # PLAY DETAILS #
    ################

    # Play Type
    df_plays.loc[idx, 'PlayType'] = 'Pass'

    ############
    # REVERSES #
    ############

    # In 'PlayDescription' all information before the "reversed" sentence is not needed.
    # - All information before the "reversed" sentence is stored within "ReverseDetails"
    if play.find('REVERSED') != -1:
      play_elements = play.split(". ")
      for i in play_elements:
        if i.find('REVERSED') != -1:
          df_plays.at[idx, 'ReverseDetails'] = play_elements[:play_elements.index(i) + 1]
          play = ". ".join(play_elements[play_elements.index(i) + 1:])
          break

    ############################
    # REPORTING IN AS ELIGIBLE #
    ############################

    # I do not think this contains any useful data so I am going to exclude it.
    if play.find('reported in as eligible') != -1:
      play_elements = play.split(". ")
      for i in play_elements:
        if i.find('reported in as eligible') != -1:
          play = ". ".join(play_elements[play_elements.index(i) + 1:])
          break

    ############
    # LATERALS #
    ############
    # - Yardage gained from a lateral.. what would this look like?
    #   - Would the lateral method completely clean that play?
    #     - I don't think so.
    #       - I am trying to figure out whether it would be better to check
    #         for laterals or fumbles first.
    #         - I want to say that it wont matter because fumble and lateral
    #           cleaning methods are using base cleaning methods for the bulk
    #           of the cleaning. They are just organizing the sentences to be
    #           cleaned by the base cleaning methods and calculating yardage..?
    #         - What happens if there is a fumble to a lateral? or a lateral
    #           to a fumble? Will the ordering affect this and clean a play
    #           such as this inaccurately?

    if play.lower().find("lateral") != -1:
      helper_clean_lateral_play(df_plays, idx)
      continue

    ###########
    # FUMBLES #
    ###########
    # - Yardage gained from a fumble.. what would this look like?
    #   - Would the fumble method completele clean that play?
    #     - I think so.
    if play.lower().find("fumble") != -1:
      helper_clean_fumble_play(df_plays, idx)
      continue

    ###########
    # OFFENSE #
    ###########


    # These may have to change in the future
    # - I do not think that the value with the 'end_spotting' will always
    #   be 'play'. I think that in the future, I will need to get more percise
    #   with this.
    #   - I think that the end spotting will actually always be in the description
    #     somewhere. I will have to locate it
    # - I do not think that 'start_spotting' will always be the field position.
    start_spotting = df_plays.loc[idx, 'FieldPosition']
    description_with_end_spotting = play
    # df_plays.loc[idx, 'Yardage'] = yardage_between_spottings(df_plays, idx, start_spotting, description_with_end_spotting)
    df_plays.loc[idx, 'Yardage'] = yardage_between_spottings(df_plays.loc[idx, 'TeamWithPossession'], start_spotting, description_with_end_spotting)

    # I am not giving up on this option of receiving play yardage
    # VVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV
    # - I think it is the fastest and is accurate for normal plays.

    # action_yardage = re.findall(standard_play_end_pattern, play)
    # if action_yardage:
    #   print(action_yardage)
    #   # End Spot
    #   df_plays.loc[idx, 'EndSpot'] = " ".join(action_yardage[0][:2])
    #   # Yardage
    #   if action_yardage[0][2] == 'no gain':
    #     df_plays.loc[idx, 'Yardage'] = 0
    #   else:
    #     df_plays.loc[idx, 'Yardage'] = action_yardage[0][2]
    # else:
    #   print("No action yardage")

    # Passer
    passer_name = re.findall(passer_name_pattern, play)
    if passer_name:
      # print(passer_name)
      df_plays.loc[idx, 'Passer'] = passer_name[0]

    # Receiver name and passing details
    receiver_name_and_passing_details = re.findall(receiver_pattern, play)
    if receiver_name_and_passing_details:
      # print(receiver_name_and_passing_details)
      df_plays.loc[idx, 'Direction'] = " ".join(receiver_name_and_passing_details[0][:2])
      df_plays.loc[idx, 'Receiver'] = receiver_name_and_passing_details[0][2]

    # Unique situation (offense spikes the ball)
    if play.find('spike') != -1:
      df_plays.loc[idx, 'Direction'] = 'spiked' # Direction?

    #############
    #  DEFENSE  #
    #############

    solo_tackle = re.findall(solo_tackle_pattern, play)
    if solo_tackle:
      if df_plays.loc[idx, 'PlayDescription'].find('pass incomplete') != -1:
        df_plays.loc[idx, 'PassDefendedBy'] = solo_tackle[0]
      else:
        df_plays.loc[idx, 'SoloTackle'] = solo_tackle[0]

    shared_tackle = re.findall(shared_tackle_pattern, play)
    if len(shared_tackle) > 0:
      if df_plays.loc[idx, 'PlayDescription'].find('pass incomplete') != -1:
        df_plays.at[idx, 'PassDefendedBy'] = shared_tackle[0]
      else:
        df_plays.at[idx, 'SharedTackle'] = shared_tackle[0]

    assisted_tackle = re.findall(assisted_tackle_pattern, play)
    if len(assisted_tackle) > 0:
      df_plays.at[idx, 'AssistedTackle'] = assisted_tackle[0][::]

    pressure_by = re.findall(defense_pressure_name_pattern, play)
    if len(pressure_by) > 0:
      df_plays.loc[idx, 'PressureBy'] = pressure_by[0]

    ##############
    #  INJURIES  #
    ##############

    injuries = re.findall(injury_pattern, play)
    if len(injuries) > 0:
      df_plays.at[idx, 'InjuredPlayers'] = injuries

    # print()

  if df_pass_plays.tail(1).index.tolist()[0] == idx:
    return df_plays

#### RUN PLAYS

In [116]:
# PURPOSE:
# - Clean run play types
# INPUT PARAMETERS:
# df_plays    - dataframe - dataframe of plays
# index_start -  integer  - the starting index of the associated input dataframe
#                           to begin cleaning.
# RETURN:
# df_plays - dataframe - dataframe of plays that now has all useful run play
#                        data accessable and clean.

# NOTE:
# - This method will be used for all actions that involve running with the football.
#   (e.g. fumble recoveries for yardage, fumble recoveries for touchdown, laterals,
#         kickoff returns, punt returns, etc..)

# def clean_run_plays(df_plays, index_start = None):
def clean_run_plays(df_plays, start_spotting = None, index_start = None):

  if index_start != None:
    df_plays_adjusted = df_plays.loc[index_start:]
    df_run_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('Run')]
  else:
    df_run_plays = df_plays[df_plays['PlayOutcome'].str.contains('Run')]

  # Iterating through every run play within 'df_run_plays'
  for idx, play in df_run_plays['PlayDescription'].items():

    # print(idx)
    # print(play)

    ################
    # Play details #
    ################

    # Play Type
    df_plays.loc[idx, 'PlayType'] = 'Run'

    ############
    # REVERSES #
    ############

    # In 'PlayDescription' all information before the "reversed" sentence is not needed.
    # - All information before is stored within 'ReverseDetails' and the remaining is cleaned.
    if play.find('REVERSED') != -1:
      play_elements = play.split(". ")
      for i in play_elements:
        if i.find("REVERSED") != -1:
          df_plays.at[idx, 'ReverseDetails'] = play_elements[:play_elements.index(i) + 1]
          play = ". ".join(play_elements[play_elements.index(i) + 1:])
          break

    ############################
    # REPORTING IN AS ELIGIBLE #
    ############################

    # I do not think this contains any useful data so I am going to exclude it.
    if play.find('reported in as eligible') != -1:
      play_elements = play.split(". ")
      for i in play_elements:
        if i.find('reported in as eligible') != -1:
          play = ". ".join(play_elements[play_elements.index(i) + 1:])
          break

    ############
    # LATERALS #
    ############
    # - Yardage gained from a lateral.. what would this look like?
    #   - Would the lateral method completely clean that play?
    #     - I think so.

    ###########
    # FUMBLES #
    ###########
    # - Yardage gained from a fumble.. what would this look like?
    #   - Would the fumble method completele clean that play?
    #     - I think so.
    # if play.lower().find("fumble") != -1:
    if play.lower().find("fumble") != -1 or play.find("MUFFS") != -1:
      helper_clean_fumble_play(df_plays, idx)
      if df_run_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      # continue

    #############
    #  OFFENSE  #
    #############

    # Rusher
    rusher_patterns = [rusher_pattern, defensive_takeaway_run_pattern, defensive_takeaway_for_touchdown]
    # Loop through patterns and find the first match
    for pattern in rusher_patterns:
      rusher_name = re.findall(pattern, play)
      if rusher_name:
        # Regular run play
        if rusher_patterns == rusher_pattern:
          # Rusher
          df_plays.loc[idx, 'Rusher'] = rusher_name[0][0]
          # Direction
          df_plays.at[idx, 'Direction'] = " ".join(rusher_name[0][1::]).strip()
        # Defensive takeaway (interception or fumble)
        # - Because punt and kickoff returns follow the same format as
        #   interception or fumble returns for yardage, this will also grab
        #   the 'Returner' name for them.
        else:
          df_plays.loc[idx, 'Rusher'] = rusher_name[0][0]
        break

    # if not rusher_name:
    #   raise ValueError(f"rusher not found at {idx}, \"{play}\"")

    # These may have to change in the future
    # - I do not think that the value with the 'end_spotting' will always
    #   be 'play'. I think that in the future, I will need to get more percise
    #   with this.
    # - I do not think that 'start_spotting' will always be the field position.
    # start_spotting = df_plays.loc[idx, 'FieldPosition']

    if start_spotting == None:
      start_spotting = df_plays.loc[idx, 'FieldPosition']
    description_with_end_spotting = play
    # df_plays.loc[idx, 'Yardage'] = yardage_between_spottings(df_plays, idx, start_spotting, description_with_end_spotting)
    df_plays.loc[idx, 'Yardage'] = yardage_between_spottings(df_plays.loc[idx, 'TeamWithPossession'], start_spotting, description_with_end_spotting)
    # Need to reset for next play
    start_spotting = None

    # YARDAGE FOR HANDOFFS? #
    # - That's what was here in the older version. Need to keep an eye out.

    #############
    #  DEFENSE  #
    #############

    solo_tackle = re.findall(solo_tackle_pattern, play)
    if solo_tackle:
        df_plays.loc[idx, 'SoloTackle'] = solo_tackle[0]

    shared_tackle = re.findall(shared_tackle_pattern, play)
    if len(shared_tackle) > 0:
        df_plays.at[idx, 'SharedTackle'] = shared_tackle[0]

    assisted_tackle = re.findall(assisted_tackle_pattern, play)
    if len(assisted_tackle) > 0:
      df_plays.at[idx, 'AssistedTackle'] = assisted_tackle[0][::]

    ##############
    #  INJURIES  #
    ##############

    injuries = re.findall(injury_pattern, play)
    if len(injuries) > 0:
      df_plays.at[idx, 'InjuredPlayers'] = injuries

    #############
    #  PENALTY  #
    #############

    # print()

    # Return if the last play has been cleaned in 'df_run_plays'
    if df_run_plays.tail(1).index.tolist()[0] == idx:
      return df_plays

####2PT CONVERSIONS

In [117]:
# I NEED A LARGER SAMPLE SIZE FOR MORE PLAYS
# - I need a sample size that has fumbled plays (if that's possible?)
# - I need a sample size that has interception (if that's possible?)
# - I need a sample size with injuries (as dark as that may sound)

def clean_2pt_conversion_plays(df_plays, index_start = None):

  # Cut 'df_plays' to begin from 'index_start' to the last '2pt conversion' play available in dataframe
  if index_start != None:
    df_plays_adjusted = df_plays.loc[index_start]
    df_2pt_conversion_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('Conversion', case=False)]
  else:
    df_2pt_conversion_plays = df_plays[df_plays['PlayOutcome'].str.contains('Conversion', case=False)]

  # Iterating through every 2pt conversion play within 'df_2pt_conversion_plays'
  for idx, play in df_2pt_conversion_plays['PlayDescription'].items():

    # print(idx)
    # print(play)

    ############
    # REVERSES #
    ############

    # In 'PlayDescription' all information before the "reversed" sentence is not needed.
    # - All information before the "reversed" sentence is stored within "ReverseDetails"
    if play.find('REVERSED') != -1:
      play_elements = split_play_description(play)
      for i in play_elements:
        if i.find('REVERSED') != -1:
          df_plays.at[idx, 'ReverseDetails'] = play_elements[:play_elements.index(i) + 1]
          play = ". ".join(play_elements[play_elements.index(i) + 1:])
          break

    ###################
    # PASSING ATTEMPT #
    ###################

    pass_2ptc = re.findall(tp_conversion_pass_pattern, play)
    if pass_2ptc:
      # print(pass_2ptc)
      df_plays.loc[idx, 'Passer'] = pass_2ptc[0][0]
      df_plays.loc[idx, 'Receiver'] = pass_2ptc[0][1]
      df_plays.loc[idx, 'PlayType'] = '2PT Conversion Pass'

    ###################
    # RUSHING ATTEMPT #
    ###################

    rush_2ptc = re.findall(tp_conversion_rush_pattern, play)
    if rush_2ptc:
      # print(rush_2ptc)
      df_plays.loc[idx, 'Rusher'] = rush_2ptc[0][0]
      # " ".join(rusher_name[0][1::]).strip()
      # df_plays.loc[idx, 'Direction'] = rush_2ptc[0][1]
      df_plays.loc[idx, 'Direction'] = " ".join(rush_2ptc[0][1::]).strip()
      df_plays.loc[idx, 'PlayType'] = '2PT Conversion Rush'

  return  df_plays

### DEFENSE CLEANING METHODS

#### INTERCEPTIONS

In [161]:
# PURPOSE:
# - Clean intercepted plays
# INPUT PARAMETERS:
# df_plays    - dataframe - dataframe of plays
# index_start -  integer  - the starting index of the associated input dataframe
#                           to begin cleaning.
# RETURN:
# df_plays - dataframe - dataframe of plays that now has all useful intercepted play
#                        data accessible and clean.

# ROUGH DESGIN
# 1. Narrow dataframe using 'index_start'
#    - This is a recursive method, the narrowing will get smaller and
#      smaller until all 'intercepted' type plays have been cleaned.
# 2. Grab first 'intercepted' play from narrowed dataframe
# 3. Create 2 single row dataframes.
#    a. intended play
#    b. yardage after interception
# 4. Break down play into sentences and clean
#    - Depending on the sentence within the play, will determine which
#      single row dataframe it will go to.
# 5. Combine both dataframes of cleaned data into one dataframe
# 6. Replace old play row with new cleaned multi row
# 7. return clean_interceped_plays( x , y)
#    - x = updated df_plays
#    - y = index directly after the last clean added row

def clean_intercepted_plays(df_plays, index_start = None):

  # Will cut df_plays starting from index_start (narrowing our search space)
  if index_start != None:
    df_plays_adjusted = df_plays.loc[index_start:]
    df_intercepted_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('Interception')]
  else:
    df_intercepted_plays = df_plays[df_plays['PlayOutcome'].str.contains('Interception')]

  # Exit case (If no more 'Interception' type plays are found)
  if df_intercepted_plays.empty:
    return df_plays

  # Retrieve the index and 'PlayDescription' of the first intercepted play in 'df_intercepted_plays'
  # - Process one play per iteration in the recursive method
  idx = df_intercepted_plays.index[0]
  play = df_plays['PlayDescription'].loc[idx]

  #############
  # VARIABLES #
  #############

  interception_spotting = None

  ############
  # REVERSES #
  ############

  # In 'PlayDescription' all information before the "reversed" sentence is not needed.
  # - All information before the "reversed" sentence is stored within "ReverseDetails"
  if play.find('REVERSED') != -1:
    play_elements = split_play_description(play)
    for i in play_elements:
      if i.find('REVERSED') != -1:
        df_plays.at[idx, 'ReverseDetails'] = play_elements[:play_elements.index(i) + 1]
        play = ". ".join(play_elements[play_elements.index(i) + 1:])
        break

  ###########
  # FUMBLES #
  ###########
  # - I am worried about the types of interception fumbles that can happen that I have yet to see.
  #   - Such as a fumble by the QB then throws an interception

  # Create 2 single row dataframes.
  # 1. intended play
  df_intended_play = df_plays.loc[idx].copy()
  df_intended_play = pd.DataFrame([df_intended_play], columns=df_plays.columns)
  df_intended_play.reset_index(drop=True, inplace=True)
  df_intended_play['PlayDescription'] = 'nan'
  # 2. yardage after interception
  df_yardage_after_interception = df_plays.loc[idx].copy()
  df_yardage_after_interception = pd.DataFrame([df_yardage_after_interception], columns=df_plays.columns)
  df_yardage_after_interception.reset_index(drop=True, inplace=True)
  df_yardage_after_interception['PlayDescription'] = 'nan'

  # break down play by sentences.
  play_elements = split_play_description(play)

  # Split play elements
  # 1. intended play
  #    - Grab all elements leading up to the sentence containing interception
  #      - Clean using 'clean_pass_plays' method
  # 2. actions after interception
  #    - Grab all elements after sentence containing interception
  #      - Clean using 'clean_run_plays' method
  #      - Clean using 'clean_touchdown_plays' method..?

  # Separating play into
  # 1. intended passing play
  # 2. remaining actions following interception
  for i in play_elements:
    if i.lower().find('intercepted') != -1:
      intended_play_playdescription = ". ".join(play_elements[:play_elements.index(i)+1])
      after_interception_playdescription = ". ".join(play_elements[play_elements.index(i)+1:])
      # print(idx)
      # print(intended_play_playdescription)
      # print(after_interception_playdescription)
      break

  #################
  # INTENDED PLAY #
  #################

  df_intended_play['PlayDescription'] = intended_play_playdescription
  df_intended_play['PlayOutcome'] = 'Pass'
  df_intended_play = clean_pass_plays(df_intended_play)
  df_intended_play['PlayOutcome'] =  df_plays['PlayOutcome'].loc[idx]

  # Intercepted by
  intercepted_by = re.findall(interception_name_pattern, intended_play_playdescription)
  if intercepted_by:
    df_intended_play['InterceptedBy'] = intercepted_by[0][0]
    interception_spotting = intercepted_by[0][1]
    # print(interception_spotting)
    # - During intercepted plays, The intended play portion of the play description is cleaned
    #   by the regular pass cleaning method. A defensive player awarded with a pass defend
    #   during an intercepted play is formatted the exact same as a player awarded a solo
    #   tackle during a completed pass play. I will leverage that here and move the player
    #   to the correct feature ('SoloTackle' -> 'PassDefendedBy')
    if df_intended_play['SoloTackle'].iloc[0] != 'nan':
      df_intended_play.at[0, 'PassDefendedBy'] = (intercepted_by[0][0], df_intended_play['SoloTackle'].iloc[0])
      df_intended_play['SoloTackle'] = 'nan'
    else:
      df_intended_play['PassDefendedBy'] = intercepted_by[0][0]





  #############################################################
  # YARDAGE AFTER INTERCEPTION / TOUCHDOWN AFTER INTERCEPTION #
  #############################################################
  # - I need this to be able to clean everything.
  #   - I need it to be able to clean regular interceptions for yardage (X)
  #   - I need it to be able to clean regular interceptions for yardage and then fumbled (X)
  #   - I need it to be able to clean interceptions resulting in multiple fumbles (X)
  #   - I need it to be able to clean interceptions for touchdowns (X)

  #   - I need it to be able to clean a fumbled interception that is recoverd for a touchdown
  #   - I need this to account for penalties

  # for action in [standard_play_end_pattern]:
  for action in [standard_play_end_pattern, touchdown_pattern]:
    yardage_after_interception = re.findall(action, after_interception_playdescription)
    if yardage_after_interception:
      df_yardage_after_interception['PlayDescription'] = after_interception_playdescription

      # Flipping team with possession when the play transitions from one team with possession to the other.
      if df_yardage_after_interception['TeamWithPossession'].iloc[0] == df_yardage_after_interception['HomeTeam'].iloc[0]:
        df_yardage_after_interception['TeamWithPossession'] = df_yardage_after_interception['AwayTeam'].iloc[0]
      else:
        df_yardage_after_interception['TeamWithPossession'] = df_yardage_after_interception['HomeTeam'].iloc[0]

      # - For yardage gained on this play, I would like to send this job to the
      #   cleaning method for run plays.
      #   - I will need to adjust 3 methods to accomplish this:
      #     1. this method
      #        - I need to add another regular expession
      #          "defensive_takeaway_run_pattern"
      #     2. run method
      #        1. I will need to add another parameter for 'start spotting'
      #           - I can grab the start spotting from the end of the sentence
      #             containing the intercepted information.
      #     3. yardage between spottings
      #        - I might have to adjust this method for touchdowns in the
      #          future. Right now I think it is capable of doing the trick,
      #          but not for touchdowns.

      # # Ideally I would like to send this off to another method.
      # if action == touchdown_after_takeaway_pattern:
      #   df_yardage_after_interception['IsScoringPlay'] = 1
      # # else:
      df_yardage_after_interception['PlayOutcome'] = 'Run'
      df_yardage_after_interception = clean_run_plays(df_yardage_after_interception, interception_spotting)
      df_yardage_after_interception['PlayOutcome'] =  df_plays['PlayOutcome'].loc[idx]

      # The 'clean_run_plays' method will change 'PlayType' so that is why I am
      # putting it down here.
      df_yardage_after_interception['PlayType'] = 'Run After Interception'
      break





  #############################
  # NEW REPLACEMENT DATAFRAME #
  #############################

  # combine both single row dataframes into one
  if df_yardage_after_interception['PlayDescription'].iloc[0] == 'nan':
    df_cleaned_replacement = df_intended_play
  else:
    df_cleaned_replacement = pd.concat([df_intended_play, df_yardage_after_interception], ignore_index=True)

  # Replace old row with new cleaned dataframe
  df_before_row = df_plays.iloc[:df_plays.index.tolist().index(idx)]
  df_after_row = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
  df_plays = pd.concat([df_before_row, df_cleaned_replacement, df_after_row], ignore_index=True)

  # print()
  # If this is the last play in the dataset
  if df_intercepted_plays.tail(1).index.tolist()[0] == idx:
    return df_plays
  else:
    return clean_intercepted_plays(df_plays, idx+len(df_cleaned_replacement))

#### SACKS

In [119]:
# PURPOSE:
# - Clean sacked plays
# INPUT PARAMETERS:
# df_plays    - dataframe - dataframe of plays
# index_start -  integer  - the starting index of the associated input dataframe
#                           to begin cleaning.
# RETURN:
# df_plays - dataframe - dataframe of plays that now has all useful sacked play
#                        data accessible and clean.

def clean_sacked_plays(df_plays, index_start = None):

  # Will cut df_plays starting from index_start (narrowing our search space)
  if index_start != None:
    df_plays_adjusted = df_plays.iloc[index_start:]
    df_sacked_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('Sack')]
  else:
    df_sacked_plays = df_plays[df_plays['PlayOutcome'].str.contains('Sack')]

  for idx, play in df_sacked_plays['PlayDescription'].items():
    # print(idx)
    # print(play)

    ############
    # REVERSES #
    ############

    # In 'PlayDescription' all information before the "reversed" sentence is not needed.
    # - All information before the "reversed" sentence is stored within "ReverseDetails"
    if play.find('REVERSED') != -1:
      play_elements = split_play_description(play)
      for i in play_elements:
        if i.find('REVERSED') != -1:
          df_plays.at[idx, 'ReverseDetails'] = play_elements[:play_elements.index(i) + 1]
          play = ". ".join(play_elements[play_elements.index(i) + 1:])
          break

    ###########
    # FUMBLES #
    ###########
    # - Yardage gained from a fumble.. what would this look like?
    #   - Would the fumble method completele clean that play?
    #     - I think so.
    if play.lower().find("fumble") != -1:
      helper_clean_fumble_play(df_plays, idx)
      if df_sacked_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      # continue


    ###########
    # OFFENSE #
    ###########

    # Sacked Passer
    sacked_passer_name = re.findall(passer_name_pattern, play)
    if sacked_passer_name:
      df_plays.loc[idx, 'Passer'] = sacked_passer_name[0]

    # Sacked Yardage lost
    df_plays.loc[idx, 'Yardage'] = yardage_between_spottings(df_plays.loc[idx, 'TeamWithPossession'],
                                                             df_plays.loc[idx, 'FieldPosition'],
                                                             play)

    ###########
    # DEFENSE #
    ###########

    # Solo sack (One person sacked the passer)
    solo_sack = re.findall(solo_tackle_pattern, play)
    if solo_sack:
      df_plays.loc[idx, 'SackedBy'] = solo_sack[0]
      df_plays.loc[idx, 'SoloTackle'] = solo_sack[0]

    # Split sack (A sack was given to the passer by multiple defenders)
    split_sack = re.findall(split_sack_pattern, play)
    if split_sack:
      df_plays.at[idx, 'SackedBy'] = split_sack[0]
      df_plays.at[idx, 'AssistedTackle'] = split_sack[0]

    ##############
    #  INJURIES  #
    ##############

    #############
    #  PENALTY  #
    #############

    if df_sacked_plays.tail(1).index.tolist()[0] == idx:
      return df_plays

### SPECIAL TEAMS CLEANING METHODS

#### PUNTS

In [120]:
# PURPOSE:
# - Clean all punt play types

# A punt playtype will be split into 2 or more rows
#   1. The Punt
#      - 'PlayType'
#         - Punt
#      - 'Punter'
#      - 'LongSnapper'
#   2. The Punt Return
#      - 'PlayType'
#         - Punt Return
#      - 'PlayOutcome'
#         - x yard punt return
#         - fair catch
#         - touchback
#         - out of bounds
#         - downed
#      - 'Returner'
#      - 'Receiver'
#      - 'Yardage'
#      - 'TackleBy1'
#      - 'TackleBy2'
#      - 'DownedBy'

def clean_punt_plays(df_plays, index_start = None):

  # Will cut df_plays starting from index_start (narrowing our search space)
  if index_start != None:
    df_plays_adjusted = df_plays.loc[index_start:]
    df_punt_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('Punt')]
  else:
    df_punt_plays = df_plays[df_plays['PlayOutcome'].str.contains('Punt')]

  if df_punt_plays.empty:
    return df_plays

  # Retrieve the index and 'PlayDescription' of the first punt play in 'df_punt_plays'
  # - Process one play per iteration in the recursive method
  idx = df_punt_plays.index[0]
  play = df_plays['PlayDescription'].loc[idx]

  #############
  # VARIABLES #
  #############

  punt_catch_spotting = None

  ############
  # REVERSES #
  ############

  # In 'PlayDescription' all information before the "reversed" sentence is not needed.
  # - All information before is stored within 'ReverseDetails' and the remaining is cleaned.
  if play.find('REVERSED') != -1:
    play_elements = split_play_description(play)
    for i in play_elements:
      if i.find("REVERSED") != -1:
        # df_play['ReverseDetails'] = play_elements[:play_elements.index(i) + 1]
        df_plays.at[idx, 'ReverseDetails'] = play_elements[:play_elements.index(i) + 1]
        play = ". ".join(play_elements[play_elements.index(i) + 1:])
        break

  ###########
  # FUMBLES #
  ###########
  # - I have yet to see a fumble during a punt. I know that it is possible
  #   and will have to update this method when that time comes.
  # - Fumble returns will be taken care of using 'clean_run_plays'

  # Create 2 single row dataframes.
  # 1. The Punt
  # df_punt = df_play
  df_punt = df_plays.loc[idx].copy()
  df_punt = pd.DataFrame([df_punt], columns=df_plays.columns)
  df_punt.reset_index(drop=True, inplace=True)
  df_punt['PlayDescription'] = 'nan'
  # 2. The Punt Return
  # df_punt_return = df_play
  df_punt_return = df_plays.loc[idx].copy()
  df_punt_return = pd.DataFrame([df_punt_return], columns=df_plays.columns)
  df_punt_return.reset_index(drop=True, inplace=True)
  df_punt_return['PlayDescription'] = 'nan'

  # break down play by sentences.
  play_elements = split_play_description(play)

  # Split play elements
  # 1. punt
  #    - Grab all elements up to the sentence containing punt
  # 2. actions after punt
  #    - Grab all elements after sentence containing punt
  #      - Clean using 'clean_run_plays' method
  #      - Clean using 'clean_touchdown_plays' method..?

  # Separating play into
  # 1. punt play
  # 2. remaining actions following punt
  for i in play_elements:
    if i.lower().find('punts') != -1:
      punt_play_playdescription = ". ".join(play_elements[:play_elements.index(i)+1])
      punt_return_playdescription = ". ".join(play_elements[play_elements.index(i)+1:])
      # print(idx)
      # print(punt_play_playdescription)
      # print(punt_return_playdescription)
      # print()
      break

  ########
  # PUNT #
  ########

  # All data needed for first row in replacement dataframe
  df_punt['PlayDescription'] = punt_play_playdescription
  df_punt['PlayOutcome'] = 'Punt'
  punt = re.findall(punting_pattern, punt_play_playdescription)
  if punt:
    punt_catch_spotting = punt[0][2]
    df_punt['PlayType'] = 'Punt'
    df_punt['PlayDescription'] = i
    df_punt['Kicker'] = punt[0][0]
    df_punt['Yardage'] = int(punt[0][1])
    df_punt['LongSnapper'] = punt[0][3]
    # Touchback
    if i.find('Touchback') != -1:
      df_punt['PlayOutcome'] = 'Touchback'
    # Out of bounds
    if i.find('out of bounds') != -1:
      df_punt['PlayOutcome'] = 'out of bounds'
    # Downed by
    if i.find('downed by') != -1:
      df_punt['PlayOutcome'] = 'downed'
      downed_by = re.findall(kick_downed_by_pattern, i)
      df_punt['DownedBy'] = downed_by[0]
    # fair catch
    if i.find('fair catch') != -1:
      df_punt['PlayOutcome'] = 'fair catch'
      fair_catch_by = re.findall(punt_fair_catch_pattern, i)
      df_punt['Returner'] = fair_catch_by[0]

  ######################################
  # PUNT RETURN (Including touchdowns) #
  ######################################

  # All data needed for the second row within replacement dataframe
  # - Second row only needed when there is a punt return for yardage
  # - I think I am going to run into trouble if there is a fumble recovery for yardage
  punt_return_patterns = [standard_play_end_pattern, touchdown_pattern, muffed_catch_pattern]
  for return_pattern in punt_return_patterns:
    punt_return = re.findall(return_pattern, punt_return_playdescription)
    if punt_return:
      df_punt_return['PlayDescription'] = punt_return_playdescription
      df_punt_return['PlayOutcome'] = 'Run'
      # Change team with possession on punt returns to the team that is returning the ball
      if df_punt['TeamWithPossession'].iloc[0] == df_punt['HomeTeam'].iloc[0]:
        df_punt_return.loc[0, 'TeamWithPossession'] = df_punt['AwayTeam'].iloc[0]
      else:
        df_punt_return.loc[0, 'TeamWithPossession'] = df_punt['HomeTeam'].iloc[0]
      df_punt_return = clean_run_plays(df_punt_return, punt_catch_spotting)
      df_punt_return['PlayOutcome'] = df_plays['PlayOutcome'].loc[idx]
      df_punt_return['PlayType'] = 'Punt Return'
      df_punt_return['Returner'] = df_punt_return['Rusher']
      df_punt_return['Rusher'] = 'nan'
      break

  #############
  #  PENALTY  #
  #############





  #############################
  # NEW REPLACEMENT DATAFRAME #
  #############################

  if df_punt_return['PlayDescription'].iloc[0] == 'nan':
    df_replacement_rows = df_punt
  elif df_punt['PlayDescription'].iloc[0] == 'nan': # Will happen during fumbled punt returns.
    df_replacement_rows = df_punt_return
  else:
    df_replacement_rows = pd.concat([df_punt, df_punt_return], ignore_index=True)

  df_before_row = df_plays.iloc[:df_plays.index.tolist().index(idx)]
  df_after_row = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
  df_plays = pd.concat([df_before_row, df_replacement_rows, df_after_row], ignore_index=True)

  if df_punt_plays.tail(1).index.tolist()[0] == idx:
    return df_plays
  else:
    return clean_punt_plays(df_plays, idx+len(df_replacement_rows))

#### KICKOFF

In [121]:
# A kickoff playtype will be split into 1 or more rows

# I need to figure out an onside kick (recovered by kicking team)
# I need to figure out fumbled kickoff returns
# I need to figure out returns for a touchdown
# injuries?

# Method can mirror punts method.

def clean_kickoff_plays(df_plays, index_start = None):

  # Will cut df_plays starting from index_start (narrowing our search space)
  if index_start != None:
    df_plays_adjusted = df_plays.loc[index_start:]
    df_kickoff_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('Kickoff', case=False)]
  else:
    df_kickoff_plays = df_plays[df_plays['PlayOutcome'].str.contains('Kickoff', case=False)]

  # exit case
  if df_kickoff_plays.empty:
    return df_plays

  # Retrieve the index and 'PlayDescription' of the first kickoff play in 'df_kickoff_plays'
  # - Process one play per iteration in the recursive method
  idx = df_kickoff_plays.index[0]
  play = df_plays['PlayDescription'].loc[idx]

  #############
  # VARIABLES #
  #############

  kickoff_catch_spotting = None

  ############
  # REVERSES #
  ############

  # In 'PlayDescription' all information before the "reversed" sentence is not needed.
  # - All information before is stored within 'ReverseDetails' and the remaining is cleaned.
  if play.find('REVERSED') != -1:
    play_elements = split_play_description(play)
    for i in play_elements:
      if i.find("REVERSED") != -1:
        # df_play['ReverseDetails'] = play_elements[:play_elements.index(i) + 1]
        df_plays.at[idx, 'ReverseDetails'] = play_elements[:play_elements.index(i) + 1]
        play = ". ".join(play_elements[play_elements.index(i) + 1:])
        break

  ###########
  # FUMBLES #
  ###########
  # - Will be taken care of using 'clean_run_plays'

  # Create 2 single row dataframes.
  # 1. The Kickoff
  df_kickoff = df_plays.loc[idx].copy()
  df_kickoff = pd.DataFrame([df_kickoff], columns=df_plays.columns)
  df_kickoff.reset_index(drop=True, inplace=True)
  df_kickoff['PlayDescription'] = 'nan'
  # 2. The Kickoff Return
  df_kickoff_return = df_plays.loc[idx].copy()
  df_kickoff_return = pd.DataFrame([df_kickoff_return], columns=df_plays.columns)
  df_kickoff_return.reset_index(drop=True, inplace=True)
  df_kickoff_return['PlayDescription'] = 'nan'

  # break down play by sentences.
  play_elements = split_play_description(play)

  # Split play elements
  # 1. kickoff
  #    - Grab all elements up to the sentence containing kickoff
  # 2. actions after kickoff
  #    - Grab all elements after sentence containing kickoff
  #      - Clean using 'clean_run_plays' method
  #      - Clean using 'clean_touchdown_plays' method..?

  # Separating play into
  # 1. punt play
  # 2. remaining actions following punt
  for i in play_elements:
    if i.lower().find('kicks') != -1:
      kickoff_play_playdescription = ". ".join(play_elements[:play_elements.index(i)+1])
      kickoff_return_playdescription = ". ".join(play_elements[play_elements.index(i)+1:])
      if kickoff_return_playdescription.find("(didn't try to advance)") != -1:
        kickoff_return_playdescription = kickoff_return_playdescription.replace("(didn't try to advance) ", "")
      # print(idx)
      # print(kickoff_play_playdescription)
      # if len(kickoff_return_playdescription) > 0:
      #   print(idx)
      #   print(kickoff_play_playdescription)
      #   print(kickoff_return_playdescription)
      #   print()
      break

  ###########
  # KICKOFF #
  ###########

  # All data needed for first row in replacement dataframe
  df_kickoff['PlayDescription'] = kickoff_play_playdescription
  df_kickoff['PlayOutcome'] = 'Kickoff'
  kickoff = re.findall(kickoff_pattern, kickoff_play_playdescription)
  if kickoff:
    # print(kickoff)
    kickoff_catch_spotting = kickoff[0][3]
    # Change team with possession on kickoff to the team that is kicking
    if df_kickoff['TeamWithPossession'].iloc[0] == df_kickoff['HomeTeam'].iloc[0]:
      df_kickoff_return.loc[0, 'TeamWithPossession'] = df_kickoff['AwayTeam'].iloc[0]
    else:
      df_kickoff_return.loc[0, 'TeamWithPossession'] = df_kickoff['HomeTeam'].iloc[0]
    df_kickoff['Kicker'] = kickoff[0][0]
    # df_kickoff['Yardage'] = int(kickoff[0][1]) <--- use helper method
    if kickoff_play_playdescription.find('Touchback') != -1:
      df_kickoff['PlayOutcome'] = 'Touchback'
    # I need to figure out what the difference will be when the kicking team recovers
    if kickoff_play_playdescription.find('onside') != -1:
      df_kickoff['PlayOutcome'] = 'onside'
      downed_by = re.findall(kick_downed_by_pattern, i)
      if downed_by:
        df_kickoff['DownedBy'] = downed_by[0]

  #########################################
  # KICKOFF RETURN (Including touchdowns) #
  #########################################



  # 1. might have to create new regular expression for kickoff return patterns
  # 2. might have to adjust yardage between spottings method to accomidate for
  #    kickoff yardage
  #    - Does yardage between spotting handle 'end zone' as an input..?



  # All data needed for the second row within replacement dataframe
  # - Second row only needed when there is a kickoff return for yardage
  kickoff_return_patterns = [standard_play_end_pattern]
  for return_pattern in kickoff_return_patterns:
    kickoff_return = re.findall(return_pattern, kickoff_return_playdescription)
    if kickoff_return:
      # print(kickoff_return)
      df_kickoff_return['PlayDescription'] = kickoff_return_playdescription
      df_kickoff_return['PlayOutcome'] = 'Run'
      # Change team with possession on kickoff returns to the team that is
      # returning the ball.
      if df_kickoff['TeamWithPossession'].iloc[0] == df_kickoff['HomeTeam'].iloc[0]:
        df_kickoff_return.loc[0, 'TeamWithPossession'] = df_kickoff['AwayTeam'].iloc[0]
      else:
        df_kickoff_return.loc[0, 'TeamWithPossession'] = df_kickoff['HomeTeam'].iloc[0]
      df_kickoff_return = clean_run_plays(df_kickoff_return, kickoff_catch_spotting)
      df_kickoff_return['PlayOutcome'] = df_plays['PlayOutcome'].loc[idx]
      df_kickoff_return['PlayType'] = 'Punt Return'
      df_kickoff_return['Returner'] = df_kickoff_return['Rusher']
      df_kickoff_return['Rusher'] = 'nan'
      break

  #############
  #  PENALTY  #
  #############





  #############################
  # NEW REPLACEMENT DATAFRAME #
  #############################

  if df_kickoff_return['PlayDescription'].iloc[0] == 'nan':
    df_replacement_rows = df_kickoff
  elif df_kickoff['PlayDescription'].iloc[0] == 'nan': # Will happen during fumbled punt returns.
    df_replacement_rows = df_kickoff_return
  else:
    df_replacement_rows = pd.concat([df_kickoff, df_kickoff_return], ignore_index=True)

  df_before_row = df_plays.iloc[:df_plays.index.tolist().index(idx)]
  df_after_row = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
  df_plays = pd.concat([df_before_row, df_replacement_rows, df_after_row], ignore_index=True)

  if df_kickoff_plays.tail(1).index.tolist()[0] == idx:
    return df_plays
  else:
    return clean_kickoff_plays(df_plays, idx+len(df_replacement_rows))

### SCORING CLEANING METHODS

#### TOUCHDOWNS

In [122]:
def clean_touchdown_plays(df_plays, index_start=None):

  # Cut 'df_plays' to begin from 'index_start' to the last touchdown play available in dataframe
  if index_start != None:
    df_plays_adjusted = df_plays.loc[index_start:]
    df_touchdown_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('Touchdown')]
  else:
    df_touchdown_plays = df_plays[df_plays['PlayOutcome'].str.contains('Touchdown')]

  # Iterating through every touchdown play within 'df_touchdown_plays'
  for idx, play in df_touchdown_plays['PlayDescription'].items():


    ##########################
    # INTERCEPTED TOUCHDOWNS #
    ##########################

    # Still need to clean intercepted play types
    if play.find("INTERCEPTED") != -1:

      # print(idx)
      # print(play)

      # creating a copy of the incercepted touchdown play and cleaning the copy
      intercepted_touchdown_row = df_plays.loc[idx].copy()
      intercepted_touchdown_row['PlayOutcome'] = 'Interception'
      intercepted_touchdown_row['IsScoringPlay'] = 1 # This will only be the value for the team that threw the interception
      intercepted_touchdown_row = pd.DataFrame([intercepted_touchdown_row], columns=df_plays.columns)
      intercepted_touchdown_row.reset_index(drop=True, inplace=True)

      # REMINDER: This single play is separated into multiple actions (play will be represented with multiple rows)
      intercepted_touchdown_row = clean_intercepted_plays(intercepted_touchdown_row)
      intercepted_touchdown_row['PlayOutcome'] = df_plays['PlayOutcome'].loc[idx]

      # Replacing old row with cleaned row
      df_before_row = df_plays.iloc[:df_plays.index.tolist().index(idx)]
      df_after_row = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
      df_plays = pd.concat([df_before_row, intercepted_touchdown_row, df_after_row], ignore_index=True)

      # print()

      # Recursion to update 'df_plays'
      if df_touchdown_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_touchdown_plays(df_plays, idx+len(intercepted_touchdown_row))

    #####################################
    # SACKED FUMBLE RECOVERY TOUCHDOWNS #
    #####################################

    ######################
    # PASSING TOUCHDOWNS #
    ######################

    # If a play has a passer throwing the ball, I am assuming it is a passing play
    passing_play = re.findall(passer_name_pattern, play)
    # - I need to adjust this in the future. I will take out:
    #   1. sack check
    #   2. interception check
    if len(passing_play) > 0 and play.find("sacked") == -1 and play.find("INTERCEPTED") == -1:
      # print(idx)
      # print(play)

      # creating a copy of the passing touchdown play row and cleaning the copy
      passing_touchdown_row = df_plays.loc[idx].copy()
      passing_touchdown_row['PlayType'] = 'Pass'
      passing_touchdown_row['PlayOutcome'] = 'Pass'
      passing_touchdown_row['IsScoringPlay'] = 1
      passing_touchdown_row = pd.DataFrame([passing_touchdown_row], columns=df_plays.columns)
      passing_touchdown_row = clean_pass_plays(passing_touchdown_row)
      passing_touchdown_row['PlayOutcome'] = df_plays['PlayOutcome'].loc[idx]

      # Replacing old row with cleaned row
      df_before_row = df_plays.iloc[:df_plays.index.tolist().index(idx)]
      df_after_row = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
      df_plays = pd.concat([df_before_row, passing_touchdown_row, df_after_row], ignore_index=True)

      # print()

      # Recursion to update 'df_plays'
      if df_touchdown_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_touchdown_plays(df_plays, idx+len(passing_touchdown_row))

    ######################
    # RUSHING TOUCHDOWNS #
    ######################

    rushing_play = re.findall(rusher_pattern, play)
    if rushing_play:

      # print(idx)
      # print(play)

      # creating a copy of the rushing touchdown play row and cleaning the copy
      rushing_touchdown_row = df_plays.loc[idx].copy()
      rushing_touchdown_row['PlayType'] = 'Run'
      rushing_touchdown_row['PlayOutcome'] = 'Run'
      rushing_touchdown_row['IsScoringPlay'] = 1
      rushing_touchdown_row = pd.DataFrame([rushing_touchdown_row], columns=df_plays.columns)
      rushing_touchdown_row = clean_run_plays(rushing_touchdown_row)
      rushing_touchdown_row['PlayOutcome'] = df_plays['PlayOutcome'].loc[idx]

      # Replacing old row with cleaned row
      df_before_row = df_plays.iloc[:df_plays.index.tolist().index(idx)]
      df_after_row = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
      df_plays = pd.concat([df_before_row, rushing_touchdown_row, df_after_row], ignore_index=True)

      # print()

      # Recursion to update 'df_plays'
      if df_touchdown_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_touchdown_plays(df_plays, idx+len(rushing_touchdown_row))

    ##########################
    # PUNT RETURN TOUCHDOWNS #
    ##########################

    punt_play = re.findall(punting_pattern, play)
    if punt_play:

      # print(idx)
      # print(play)

      # creating a copy of the punt touchdown play and cleaning the copy
      punt_touchdown_row = df_plays.loc[idx].copy()
      punt_touchdown_row['PlayOutcome'] = 'Punt'
      punt_touchdown_row['IsScoringPlay'] = 1 # This will only be the value for the team that punted the ball
      punt_touchdown_row = pd.DataFrame([punt_touchdown_row], columns=df_plays.columns)
      punt_touchdown_row.reset_index(drop=True, inplace=True)

      # REMINDER: This single play is separated into multiple actions (play will be represented with multiple rows)
      punt_touchdown_row = clean_punt_plays(punt_touchdown_row)
      punt_touchdown_row['PlayOutcome'] = df_plays['PlayOutcome'].loc[idx]

      # Replacing old row with cleaned row
      df_before_row = df_plays.iloc[:df_plays.index.tolist().index(idx)]
      df_after_row = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
      df_plays = pd.concat([df_before_row, punt_touchdown_row, df_after_row], ignore_index=True)

      # print()

      # Recursion to update 'df_plays'
      if df_touchdown_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_touchdown_plays(df_plays, idx+len(punt_touchdown_row))

    #################################
    # BLOCKED FIELD GOAL TOUCHDOWNS #
    #################################

  return df_plays

#### FIELD GOALS

In [123]:
# I need an example of when a player returns the field goal for yardage
# I need a larger sample size for "Blocked" field goals
# I need to figure out what to do if someone fumbles a recovery
# I need to figure out what to do on a trick play (e.i. holder runs out with the ball)
# - INCOMPLETE. NEED LARGER SAMPLE SIZE

def clean_field_goal_plays(df_plays, index_start = None):

  # Adjusting df_plays to start cleaning at a specified index (index_start)
  if index_start != None:
    df_plays_adjusted = df_plays.loc[index_start:]
    # Locating all field goal plays within dataframe
    df_field_goal_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('Field Goal')]
  else:
    # Locating all field goal plays within dataframe
    df_field_goal_plays = df_plays[df_plays['PlayOutcome'].str.contains('Field Goal')]

  for idx, play in df_field_goal_plays['PlayDescription'].items():

    # print(idx)
    # print(play)

    ###################
    # EXTRA PLAY DATA #
    ###################

    #########################
    # FIELD GOAL SITUATIONS #
    #########################

    field_goal = re.findall(field_goal_pattern, play)
    if field_goal:
      # print(field_goal)
      df_plays.loc[idx, 'PlayType'] = 'Field Goal'
      df_plays.loc[idx, 'Kicker'] = field_goal[0][0]
      df_plays.loc[idx, 'Yardage'] = int(field_goal[0][1])
      # Element will only fill if the field goal was no good
      if field_goal[0][2] != '':
        df_plays.loc[idx, 'Direction'] = field_goal[0][2]
      df_plays.loc[idx, 'LongSnapper'] = field_goal[0][3]
      df_plays.loc[idx, 'Holder'] = field_goal[0][4]

    ######################
    # FIELD GOAL BLOCKED #
    ######################

    field_goal_blocked = re.findall(field_goal_blocked_pattern, play)
    if field_goal_blocked:
      # print(idx)
      # print(play)
      # print(field_goal_blocked)

      # Because of the potential recovery for yardage after a blocked field
      # goal, I need 2 dataframes:
      # 1. intended field goal attempt (containing blocked details)
      df_blocked_fg = df_plays.loc[idx].copy()
      df_blocked_fg = pd.DataFrame([df_blocked_fg], columns=df_plays.columns)
      df_blocked_fg.reset_index(drop=True, inplace=True)
      df_blocked_fg['PlayDescription'] = 'nan'
      # 2. yardage after recovery
      df_yardage_after_recovery = df_plays.loc[idx].copy()
      df_yardage_after_recovery = pd.DataFrame([df_yardage_after_recovery], columns=df_plays.columns)
      df_yardage_after_recovery.reset_index(drop=True, inplace=True)
      df_yardage_after_recovery['PlayDescription'] = 'nan'

      # I need to separate the play description into 2 groups:
      # 1. All actions involving the field goal and field goal block
      #    - for df_blocked_fg
      # 2. All actions following the field goal block recovery for yardage
      #    - for df_yardage_after_recovery
      play_elements = split_play_description(play)
      for i in play_elements:
        if i.lower().find('blocked') != -1:
          blocked_fg_playdescription = ". ".join(play_elements[:play_elements.index(i)+1])
          yardage_after_recovery_playdescription = ". ".join(play_elements[play_elements.index(i)+1:])
          # print(blocked_fg_playdescription)
          # print(yardage_after_recovery_playdescription)

      #####################
      # BOCKED FIELD GOAL #
      #####################

      df_blocked_fg['PlayDescription'] = blocked_fg_playdescription
      df_blocked_fg['PlayType'] = 'Field Goal'
      blocked_fg_data = re.findall(field_goal_blocked_pattern, blocked_fg_playdescription)
      df_blocked_fg['Kicker'] = blocked_fg_data[0][0]
      df_blocked_fg['Yardage'] = int(blocked_fg_data[0][1])
      df_blocked_fg['BlockedBy'] = blocked_fg_data[0][2]
      df_blocked_fg['LongSnapper'] = blocked_fg_data[0][3]
      df_blocked_fg['Holder'] = blocked_fg_data[0][4]

      ###################################
      # FIELD GOAL RECOVERY FOR YARDAGE #
      ###################################

      # If there was a recovery for yardage
      if re.findall(standard_play_end_pattern, yardage_after_recovery_playdescription):
        df_yardage_after_recovery['PlayDescription'] = yardage_after_recovery_playdescription
        df_yardage_after_recovery['PlayOutcome'] = 'Run'
        # blocked_fg_data[0][5] - team of the player that recovered the ball
        df_yardage_after_recovery['TeamWithPossession'] = blocked_fg_data[0][5]
        # blocked_fg_data[0][7] - recovery spotting of player who recovered ball
        df_yardage_after_recovery = clean_run_plays(df_yardage_after_recovery, blocked_fg_data[0][7])
        df_yardage_after_recovery['PlayOutcome'] = df_plays['PlayOutcome'].loc[idx]
        df_yardage_after_recovery['PlayType'] = 'Field Goal Return'

      #################
      # NEW DATAFRAME #
      #################
      # - If there was a recovery for yardage, there will be multiple rows
      #   within the dataframe of plays that represent the blocked field goal
      #   along with actions that followed.

      if df_yardage_after_recovery['PlayDescription'].iloc[0] == 'nan':
        df_replacement_rows = df_blocked_fg
      else:
        df_replacement_rows = pd.concat([df_blocked_fg, df_yardage_after_recovery], ignore_index=True)

      df_before_row = df_plays.iloc[:df_plays.index.tolist().index(idx)]
      df_after_row = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
      df_plays = pd.concat([df_before_row, df_replacement_rows, df_after_row], ignore_index=True)

      if df_field_goal_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_field_goal_plays(df_plays, idx+len(df_replacement_rows))

      # print()

  return df_plays

#### EXTRA POINTS

In [166]:
def clean_extra_point_plays(df_plays, index_start = None):

  # Adjusting df_plays to start cleaning at a specified index (index_start)
  if index_start != None:
    df_plays_adjusted = df_plays.loc[index_start:]
    # Locating all field goal plays within dataframe
    df_field_goal_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('Extra Point')]
  else:
    # Locating all field goal plays within dataframe
    df_field_goal_plays = df_plays[df_plays['PlayOutcome'].str.contains('Extra Point')]

  for idx, play in df_field_goal_plays['PlayDescription'].items():

    # print(idx)
    # print(play)

    ###################
    # EXTRA PLAY DATA #
    ###################


    ##########################
    # EXTRA POINT SITUATIONS #
    ##########################

    extra_point = re.findall(extra_point_pattern, play)
    if extra_point:
      # print(extra_point)
      df_plays.loc[idx, 'PlayType'] = 'Extra Point'
      df_plays.loc[idx, 'Kicker'] = extra_point[0][0]
      if extra_point[0][1] != '':
        df_plays.loc[idx, 'Direction'] = extra_point[0][1]
      df_plays.loc[idx, 'LongSnapper'] = extra_point[0][2]
      df_plays.loc[idx, 'Holder'] = extra_point[0][2]

    # print()

  return df_plays

### OTHER CLEANING METHODS

#### FUMBLES

In [125]:
# What about punt returns?

def clean_fumble_plays(df_plays, index_start=None):

  # Cut 'df_plays' to begin from 'index_start' to the last penalty play available in dataframe
  if index_start != None:
    df_plays_adjusted = df_plays.loc[index_start:]
    df_fumble_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('fumble', case=False)]
  else:
    df_fumble_plays = df_plays[df_plays['PlayOutcome'].str.contains('fumble', case=False)]

  for idx, play in df_fumble_plays['PlayDescription'].items():

    ############
    # REVERSES #
    ############

    # In 'PlayDescription' all information before the "reversed" sentence is not needed.
    # - All information before is stored within 'ReverseDetails' and the remaining is cleaned.
    if play.find('REVERSED') != -1:
      play_elements = split_play_description(play)
      for i in play_elements:
        if i.find("REVERSED") != -1:
          df_plays.at[idx, 'ReverseDetails'] = play_elements[:play_elements.index(i) + 1]
          play = ". ".join(play_elements[play_elements.index(i) + 1:])
          break

    ############################
    # REPORTING IN AS ELIGIBLE #
    ############################

    # I do not think this contains any useful data so I am going to exclude it.
    if play.find('reported in as eligible') != -1:
      play_elements = split_play_description(play)
      for i in play_elements:
        if i.find('reported in as eligible') != -1:
          play = ". ".join(play_elements[play_elements.index(i) + 1:])
          break

    initial_action = split_play_description(play)[0]

    # print(idx)
    # print(play)
    # print(initial_action)

    ##################
    # PASSING FUMBLE #
    ##################

    fumble_pass = re.findall(receiver_pattern, initial_action)
    if fumble_pass:

      # creating a copy of the passing fumbled play row and cleaning the copy
      passing_fumble_row = df_plays.loc[idx].copy()
      passing_fumble_row['PlayOutcome'] = 'Pass'
      passing_fumble_row = pd.DataFrame([passing_fumble_row], columns=df_plays.columns)
      passing_fumble_row = clean_pass_plays(passing_fumble_row)

      # Record whether the pass was complete or incomplete.
      if play.find('pass incomplete') != -1:
        passing_fumble_row['PlayOutcome'] = f"{df_plays['PlayOutcome'].loc[idx]} (I)"
      else:
        passing_fumble_row['PlayOutcome'] = f"{df_plays['PlayOutcome'].loc[idx]} (C)"

      # Replacing old row with cleaned row
      df_before_row = df_plays.iloc[:df_plays.index.tolist().index(idx)]
      df_after_row = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
      df_plays = pd.concat([df_before_row, passing_fumble_row, df_after_row], ignore_index=True)

      # Recursion to update 'df_plays'
      if df_fumble_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_fumble_plays(df_plays, idx+len(passing_fumble_row))

    ##################
    # RUSHING FUMBLE #
    ##################

    fumble_rush = re.findall(rusher_pattern, initial_action)
    fumble_aborted = initial_action.find('Aborted')
    if fumble_rush or fumble_aborted != -1:

      # creating a copy of the rushing fumbled play row and cleaning the copy
      rushing_fumble_row = df_plays.loc[idx].copy()
      rushing_fumble_row['PlayOutcome'] = 'Run'
      rushing_fumble_row = pd.DataFrame([rushing_fumble_row], columns=df_plays.columns)
      rushing_fumble_row = clean_run_plays(rushing_fumble_row)
      rushing_fumble_row['PlayOutcome'] = df_plays['PlayOutcome'].loc[idx]

      # Replacing old row with cleaned row
      df_before_row = df_plays.iloc[:df_plays.index.tolist().index(idx)]
      df_after_row = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
      df_plays = pd.concat([df_before_row, rushing_fumble_row, df_after_row], ignore_index=True)

      # Recursion to update 'df_plays'
      if df_fumble_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_fumble_plays(df_plays, idx+len(rushing_fumble_row))

    #################
    # SACKED FUMBLE #
    #################

    if initial_action.find('sacked') != -1:

      # creating a copy of the sacked fumble play row and cleaning the copy
      sacked_fumble_row = df_plays.loc[idx].copy()
      sacked_fumble_row['PlayOutcome'] = 'Sack'
      sacked_fumble_row = pd.DataFrame([sacked_fumble_row], columns=df_plays.columns)
      sacked_fumble_row = clean_sacked_plays(sacked_fumble_row)
      sacked_fumble_row['PlayOutcome'] = df_plays['PlayOutcome'].loc[idx]

      # Replacing old row with cleaned row
      df_before_row = df_plays.iloc[:df_plays.index.tolist().index(idx)]
      df_after_row = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
      df_plays = pd.concat([df_before_row, sacked_fumble_row, df_after_row], ignore_index=True)

      # Recursion to update 'df_plays'
      if df_fumble_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_fumble_plays(df_plays, idx+len(sacked_fumble_row))

    ##################
    # KICKOFF FUMBLE #
    ##################

    kickoff_fumble = re.findall(kickoff_pattern, initial_action)
    if kickoff_fumble:

      # creating a copy of the passing fumbled play row and cleaning the copy
      kickoff_fumble_row = df_plays.loc[idx].copy()
      kickoff_fumble_row['PlayOutcome'] = 'kickoff'
      kickoff_fumble_row = pd.DataFrame([kickoff_fumble_row], columns=df_plays.columns)
      kickoff_fumble_row = clean_kickoff_plays(kickoff_fumble_row)
      kickoff_fumble_row['PlayOutcome'] = df_plays['PlayOutcome'].loc[idx]

      # Replacing old row with cleaned row
      df_before_row = df_plays.iloc[:df_plays.index.tolist().index(idx)]
      df_after_row = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
      df_plays = pd.concat([df_before_row, kickoff_fumble_row, df_after_row], ignore_index=True)

      # Recursion to update 'df_plays'
      if df_fumble_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_fumble_plays(df_plays, idx+len(kickoff_fumble_row))

    ###############
    # PUNT FUMBLE #
    ###############
    # - Need to figure out what to do for "muffed" catches

    punt_fumble = re.findall(punting_pattern, initial_action)
    if punt_fumble:

      # creating a copy of the fumbled play row and cleaning the copy
      punt_fumble_row = df_plays.loc[idx].copy()
      punt_fumble_row['PlayOutcome'] = 'Punt'
      punt_fumble_row = pd.DataFrame([punt_fumble_row], columns=df_plays.columns)
      punt_fumble_row = clean_punt_plays(punt_fumble_row)
      punt_fumble_row['PlayOutcome'] = df_plays['PlayOutcome'].loc[idx]

      # Replacing old row with cleaned row
      df_before_row = df_plays.iloc[:df_plays.index.tolist().index(idx)]
      df_after_row = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
      df_plays = pd.concat([df_before_row, punt_fumble_row, df_after_row], ignore_index=True)

      # Recursion to update 'df_plays'
      if df_fumble_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_fumble_plays(df_plays, idx+len(punt_fumble_row))

    # print()

  return df_plays

#### PENALTIES

In [126]:
# Will work on after fumbles.

#### TURNOVER ON DOWNS

In [127]:
# Looks like either a pass / run / sack play

def clean_turnover_on_downs_plays(df_plays, index_start=None):

  # Cut 'df_plays' to begin from 'index_start' to the last penalty play available in dataframe
  if index_start != None:
    df_plays_adjusted = df_plays.loc[index_start:]
    df_turnover_on_downs_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('Turnover on Downs', case=False)]
  else:
    df_turnover_on_downs_plays = df_plays[df_plays['PlayOutcome'].str.contains('Turnover on Downs', case=False)]

  # Iterating through every penalty play within 'df_turnover_on_downs_plays'
  for idx, play in df_turnover_on_downs_plays['PlayDescription'].items():

    ############
    # REVERSES #
    ############

    # In 'PlayDescription' all information before the "reversed" sentence is not needed.
    # - All information before is stored within 'ReverseDetails' and the remaining is cleaned.
    if play.find('REVERSED') != -1:
      play_elements = split_play_description(play)
      for i in play_elements:
        if i.find("REVERSED") != -1:
          df_plays.at[idx, 'ReverseDetails'] = play_elements[:play_elements.index(i) + 1]
          play = ". ".join(play_elements[play_elements.index(i) + 1:])
          break

    ##############################
    # TURNOVER ON DOWNS (SACKED) #
    ##############################

    if play.find("sacked") != -1:

      sacked_turnover_on_downs = df_plays.loc[idx].copy()
      sacked_turnover_on_downs['PlayOutcome'] = 'Sack'
      sacked_turnover_on_downs = pd.DataFrame([sacked_turnover_on_downs], columns=df_plays.columns)
      sacked_turnover_on_downs.reset_index(drop=True, inplace=True)
      sacked_turnover_on_downs = clean_sacked_plays(sacked_turnover_on_downs)
      sacked_turnover_on_downs['PlayOutcome'] = df_plays['PlayOutcome'].loc[idx]

      # Replacing old row with cleaned row
      df_before_row = df_plays.iloc[:df_plays.index.tolist().index(idx)]
      df_after_row = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
      df_plays = pd.concat([df_before_row, sacked_turnover_on_downs, df_after_row], ignore_index=True)

      # Recursion to update 'df_plays'
      if df_turnover_on_downs_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_turnover_on_downs_plays(df_plays, idx+len(sacked_turnover_on_downs))

    ############################
    # TURNOVER ON DOWNS (PASS) #
    ############################

    passing_play = re.findall(passer_name_pattern, play)
    if passing_play:

      passing_turnover_on_downs = df_plays.loc[idx].copy()
      passing_turnover_on_downs['PlayOutcome'] = 'Pass'
      passing_turnover_on_downs = pd.DataFrame([passing_turnover_on_downs], columns=df_plays.columns)
      passing_turnover_on_downs = clean_pass_plays(passing_turnover_on_downs)

      # Record whether the pass was complete or incomplete.
      if play.find('pass incomplete') != -1:
        passing_turnover_on_downs['PlayOutcome'] = f"{df_plays['PlayOutcome'].loc[idx]} (I)"
      else:
        passing_turnover_on_downs['PlayOutcome'] = f"{df_plays['PlayOutcome'].loc[idx]} (C)"

      # Replacing old row with cleaned row
      df_before_row = df_plays.iloc[:df_plays.index.tolist().index(idx)]
      df_after_row = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
      df_plays = pd.concat([df_before_row, passing_turnover_on_downs, df_after_row], ignore_index=True)

      # Recursion to update 'df_plays'
      if df_turnover_on_downs_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_turnover_on_downs_plays(df_plays, idx+len(passing_turnover_on_downs))

    ############################
    # TURNOVER ON DOWNS (RUSH) #
    ############################

    rushing_play = re.findall(rusher_pattern, play)
    if len(rushing_play) > 0:

      rushing_turnover_on_downs = df_plays.loc[idx].copy()
      rushing_turnover_on_downs['PlayOutcome'] = 'Run'
      rushing_turnover_on_downs = pd.DataFrame([rushing_turnover_on_downs], columns=df_plays.columns)
      rushing_turnover_on_downs = clean_run_plays(rushing_turnover_on_downs)
      rushing_turnover_on_downs['PlayOutcome'] = df_plays['PlayOutcome'].loc[idx]

      df_before_row = df_plays.iloc[:df_plays.index.tolist().index(idx)]
      df_after_row = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
      df_plays = pd.concat([df_before_row, rushing_turnover_on_downs, df_after_row], ignore_index=True)

      if df_turnover_on_downs_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_turnover_on_downs_plays(df_plays, idx+len(rushing_turnover_on_downs))

## 5. PIPELINE MAIN METHOD

In [128]:
# PURPOSE:
# - Accept a dataframe of nfl plays (formatted by NFL_Scrapers) and
#   return a cleaned dataframe of those plays.
# INPUT PARAMETERS:
# df_all_plays         - dataframe - all plays in raw form from NFL_Scraper that user
#                                    would like to clean.
# OUTPUT:
# df_all_plays_cleaned - dataframe - all plays from 'df_all_plays' cleaned and data
#                                    dispersed into individual new features.

# CURRENT DESIGN PLAN:
# 1. Use uniquely designed methods for each play type to clean within dataframe
#    - (e.g. pass, run, touchdown, punt, sack, ... )
# 2. Repeat until all plays within dataframe have been cleaned.
#   NOTE:
#   - It is important to fully clean a play type before moving to the next
#      because sometimes cleaning could involve adding a new row to the dataframe,
#      causing a reset to the dataframes indexing.
#      - If we were to separate all play types from the beginning, the indexes
#        could shift around causing, for example, an index that might originally
#        point to a run play to now instead point at a pass play.

def clean_dataframe_of_plays(df_all_plays):

  # Return Dataframe
  df_all_plays_cleaned = df_all_plays.copy()

  ################################
  # RAW DATA COLUMN DESCRIPTIONS #
  ################################

  # Season             - Year of the season
  # Week               - Game week of the season (e.g. 'Week 1')
  # Day                - Day of the week (e.g. 'MON')
  # Date               - Month and day of the game formatted MM/DD (e.g. '09/07')
  # AwayTeam           - Visiting team of the game
  # HomeTeam           - Home team of the game
  # Quarter            - Quarter that the play is in
  #                      - NOT ACCURATE. Drives that go between quarters will end up
  #                        having all plays in the later quarter.
  # DriveNumber        - Drive number of the quarter that the play is in
  # TeamWithPossession - Team that started with the ball at the beginning of the play.
  # IsScoringDrive     - Does the drive that the focused play in result in a score?
  # PlayNumberInDrive  - Play count in the drive
  # IsScoringPlay      - Did the play result in a score?
  # PlayOutcome        - Ultimate result of the play (e.g. '13 Yard Pass')
  # PlayStart          - The down and where the play started on the field (e.g. '2nd & 9 at DET 21')
  # PlayTimeFormation  - Time left in the quarter / quarter / play formation
  # PlayDescription    - The raw description given of the focused play, entailing everything
  #                      that happened within it.

  #############################################################
  # TRANSFORMING FEATURE VALUES (PREPPING DATA TO BE CLEANED) #
  #############################################################
  df_all_plays_cleaned = playtimeformation_split(df_all_plays_cleaned)
  df_all_plays_cleaned = playstart_split(df_all_plays_cleaned)
  df_all_plays_cleaned = consistent_team_names(df_all_plays_cleaned)

  ######################################
  # NEW ADDITIONAL COLUMN DESCRIPTIONS #
  ######################################

  # ~ General features ~
  # TimeOnTheClock     - NOT HERE ANYMORE.

  # ~ Offensive features ~
  # EndSpot            - Where the end of the play has been spotted
  #                      - This can also be where the end of the action within a play has been spotted.
  # PlayType           - The type of play (e.g. pass/run)
  # Formation          - Play formation
  # Passer             - Player that threw the ball (mostly the quarterback)
  # Rusher             - Player that ran the ball (mostly the runningback)
  # Receiver           - Player on the same team as the passer that caught the ball
  # Direction          - Where the ball is going during the play
  # Yardage            - Yards gained during the play
  #                      - (Should specify that yardage does not include extra yardage gained from penalties)
  #                      - (Player awarded yardage)
  #                      - (also includes how far kicks have gone during kickoffs and punts)

  # ~ Defensive features ~
  # SoloTackle         - Player awarded a solo tackle from a play
  # AssistedTackle     - Player awarded an assisted tackle from a play
  # SharedTackle       - Player awarded a shared tackle from a play
  # PassDefendedBy     - Defender that defended the passing play
  # PressureBy         - Defender that applied pressure to the passer
  # InterceptedBy      - Defender that intercepted the passing play
  # SackedBy           - Player awarded a sack from a play. (Could be solo or split)
  # ForcedFumbledBy    - Player awarded a forced fumble from a play

  # ~ Unique features (uncommon) ~
  # WhoFumbled         - Player who last held the ball during a fumble.
  # FumbleRecoveredBy  - Player who recovered the fumbled ball
  # FumbleDetails      - A list that has what happened after the fumble
  #                      - [forced fumble by, recovered by, yards gained, tackled by]
  # ReverseDetails     - A list having plays leading up to play reversal
  # InjuredPlayers     - Players that were injured during the play
  # AcceptedPenalty    - Penalty on the field that was accepted
  # DeclinedPenalty    - Penalty on the field that was declined

  # ~ Special teams features ~
  # Kicker             - Player who kicked the ball during a kickoff / punt / extra point / field goal
  # LongSnapper        - Player who snapped the ball during a punt / extra point / field goal
  # Returner           - Player who returned the ball during a kickoff / punt
  # DownedBy           - ? ? ? I forget
  # Holder             - Player who held ball for extra point / field goal
  # BlockedBy          - Player who blocked a punt / extra point / field goal

  new_columns = ["EndSpot",
                 "PlayType", "Passer", "Rusher", "Receiver", "Direction", "Yardage",
                 "SoloTackle", "AssistedTackle", "SharedTackle", 'PassDefendedBy', "PressureBy", "InterceptedBy", "SackedBy", "ForcedFumbleBy",
                 "WhoFumbled", "FumbleRecoveredBy", "FumbleDetails", "ReverseDetails", "InjuredPlayers", "AcceptedPenalty", "DeclinedPenalty",
                 "Kicker", "LongSnapper", "Returner", "DownedBy", "Holder", "BlockedBy"]

  string_columns = ["EndSpot",
                    "PlayType", "Passer", "Rusher", "Receiver", "Direction",
                    "SoloTackle", "AssistedTackle", "SharedTackle", 'PassDefendedBy', "PressureBy", "InterceptedBy", "SackedBy", "ForcedFumbleBy",
                    "WhoFumbled", "FumbleRecoveredBy", "FumbleDetails", "ReverseDetails", "InjuredPlayers", "AcceptedPenalty", "DeclinedPenalty",
                    "Kicker", "LongSnapper", "Returner", "DownedBy", "Holder", "BlockedBy"]

  int_columns = ["Yardage"]

  ########################################
  # RETURN DATAFRAME WITH ADDED FEATURES #
  ########################################

  df_all_plays_cleaned = df_all_plays_cleaned.reindex(columns=df_all_plays_cleaned.columns.tolist() + new_columns)
  df_all_plays_cleaned[string_columns] = df_all_plays_cleaned[string_columns].astype(str)
  df_all_plays_cleaned[int_columns] = df_all_plays_cleaned[int_columns].astype(float)

  ########################################
  # GETTING PLAY CATEGORIES AND CLEANING #
  ########################################
  # TOUCHDOWNS MUST BE CLEANED FIRST
  # - Any touchdown resulting from a change in possession (e.g. Interception for Touchdown)
  #   raw data states that the team on defense had possession the entire drive.
  #   - So all plays leading up to the touchdown state that the defense has possession.
  df_all_plays_cleaned = clean_touchdown_plays(df_all_plays_cleaned)
  df_all_plays_cleaned = clean_pass_plays(df_all_plays_cleaned)
  df_all_plays_cleaned = clean_run_plays(df_all_plays_cleaned)
  df_all_plays_cleaned = clean_2pt_conversion_plays(df_all_plays_cleaned)
  df_all_plays_cleaned = clean_intercepted_plays(df_all_plays_cleaned)
  df_all_plays_cleaned = clean_sacked_plays(df_all_plays_cleaned)
  df_all_plays_cleaned = clean_punt_plays(df_all_plays_cleaned)
  df_all_plays_cleaned = clean_kickoff_plays(df_all_plays_cleaned)
  df_all_plays_cleaned = clean_field_goal_plays(df_all_plays_cleaned)
  df_all_plays_cleaned = clean_extra_point_plays(df_all_plays_cleaned)
  df_all_plays_cleaned = clean_turnover_on_downs_plays(df_all_plays_cleaned)
  df_all_plays_cleaned = clean_fumble_plays(df_all_plays_cleaned)



  return df_all_plays_cleaned

# TESTING

In [169]:
df_week2_plays_cleaned = clean_dataframe_of_plays(week2_2023_plays)

842
-----------------------
- 1. PLAY DESCRIPTION -
-----------------------
— A.Richardson pass short left to M.Pittman to IND 37 for 12 yards (M.Stewart) [W.Anderson]. FUMBLES (M.Stewart), ball out of bounds at IND 42.

-----------------------------
- 2. PLAY DESCRIPTION SPLIT -
-----------------------------
— A.Richardson pass short left to M.Pittman to IND 37 for 12 yards (M.Stewart) [W.Anderson]
FUMBLES (M.Stewart), ball out of bounds at IND 42.

------------------------------
- 3. PLAYDESCRIPTION GROUPED -
------------------------------
['— A.Richardson pass short left to M.Pittman to IND 37 for 12 yards (M.Stewart) [W.Anderson]', 'FUMBLES (M.Stewart), ball out of bounds at IND 42.']


963
-----------------------
- 1. PLAY DESCRIPTION -
-----------------------
— J.Patterson to HOU 36 for -5 yards. FUMBLES, recovered by HOU-C.Stroud at HOU 35. C.Stroud pass short right to R.Woods to HOU 49 for 8 yards (E.Speed).

-----------------------------
- 2. PLAY DESCRIPTION SPLIT -
---------

In [130]:
df_week2_plays_cleaned.shape

(2835, 46)

In [131]:
df_week2_plays_cleaned

Unnamed: 0,Season,Week,Day,Date,AwayTeam,HomeTeam,Quarter,DriveNumber,TeamWithPossession,IsScoringDrive,...,ReverseDetails,InjuredPlayers,AcceptedPenalty,DeclinedPenalty,Kicker,LongSnapper,Returner,DownedBy,Holder,BlockedBy
0,2023,Week 2,THU,09/14,MIN,PHI,1,1,PHI,1,...,,,,,G.Joseph,,,,,
1,2023,Week 2,THU,09/14,MIN,PHI,1,1,PHI,1,...,,,,,,,,,,
2,2023,Week 2,THU,09/14,MIN,PHI,1,1,PHI,1,...,,,,,,,,,,
3,2023,Week 2,THU,09/14,MIN,PHI,1,1,PHI,1,...,,[J.Metellus],,,,,,,,
4,2023,Week 2,THU,09/14,MIN,PHI,1,1,PHI,1,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2830,2023,Week 2,MON,09/18,CLE,PIT,4,6,CLE,0,...,,,,,,,,,,
2831,2023,Week 2,MON,09/18,CLE,PIT,4,6,CLE,0,...,,,,,,,,,,
2832,2023,Week 2,MON,09/18,CLE,PIT,4,6,CLE,0,...,,,,,,,,,,
2833,2023,Week 2,MON,09/18,CLE,PIT,4,7,PIT,0,...,,,,,,,,,,


# PLAYTYPE OBSERVATIONS

In [132]:
# Modifying plays to match cleaned plays transformed features
# ( e.g. Quarter(original) = '1st Quarter
#        Quarter(transform) = 1 )
# - This is needed in order to match plays from the original dataframe
#   to the cleaned dataframe.
df_week2_plays_modified = week2_2023_plays.copy()
df_week2_plays_modified = playtimeformation_split(df_week2_plays_modified)
df_week2_plays_modified = playstart_split(df_week2_plays_modified)
df_week2_plays_modified = consistent_team_names(df_week2_plays_modified)

## HELPER METHOD

In [133]:
# PURPOSE:
# - A tool that can be used to compare original plays and their cleaned versions

# I would like to return a map that has:
# KEY: index of original unclean play
# VALUE: index(es) of cleaned play

def unclean_to_clean_play_matches(df_unclean_plays, df_clean_plays):

  my_map = {}

  # This list of features is unique to each play
  # - Both the unclean and cleaned versions of the plays have these same features, therefore
  #   they will be used to match unclean plays in 'df_unclean_plays' to clean plays in 'df_clean_plays'
  matching_features = ['Season', 'Week', 'Date', 'AwayTeam', 'HomeTeam', 'Quarter', 'DriveNumber', 'PlayNumberInDrive']

  # Iterate through each row of the unclean plays dataframe
  for u_row in df_unclean_plays.itertuples(index=True):
    u_features = [getattr(u_row, col) for col in matching_features]

    matching_indexes = []
    matches_found = False

    # Iterate through each row of the dataframe of cleaned plays
    # - The starting index will be the index of the unclean play within the main original dataframe of plays
    #   - The matching cleaned pair will either be at the exact same location or higher
    for c_row in df_clean_plays[u_row.Index::].itertuples(index=True):
      c_features = [getattr(c_row, col) for col in matching_features]

      # If a match is found, check for consective rows of matches because some uncleaned plays needed to be cleaned using multiple rows
      # - Once a row that does not match follows one that does, will break the loop because the one play match has been found.
      if u_features == c_features:
        matching_indexes.append(c_row.Index)
        matches_found = True
      elif matches_found:
        my_map[u_row.Index] = matching_indexes
        break

  return my_map

## OFFENSIVE PLAYS

### PASSING PLAYS

In [134]:
# All passing plays
df_unclean_pass_plays = df_week2_plays_modified.loc[(df_week2_plays_modified['PlayOutcome'].str.contains('Pass'))]

map_unclean_clean_pass_plays = unclean_to_clean_play_matches(df_unclean_pass_plays, df_week2_plays_cleaned)

len(map_unclean_clean_pass_plays.keys())

# # All passing plays to 'A.St. Brown'
# # - I need to figure out how to separate each sentence of the play description. Currently I am splitting them
# #   by finding this set of characters ". ", This will not work all the time and might actually cause error because
# #   some players have names that have ". " in them and this will cause the splitting to be at their name.
# df_unclean_pass_plays = df_week2_plays_modified.loc[(df_week2_plays_modified['PlayOutcome'].str.contains('Pass')) &
#                                                     (df_week2_plays_modified['PlayDescription'].str.contains('A.St. Brown', case=False))]

# map_unclean_clean_pass_plays = unclean_to_clean_play_matches(df_unclean_pass_plays, df_week2_plays_cleaned)

# len(map_unclean_clean_pass_plays.keys())

1045

In [135]:
# Every unclean passing play and their associated cleaned play breakdown

for i in map_unclean_clean_pass_plays.keys():
  print(f"({i}, {map_unclean_clean_pass_plays.get(i)})")
  play = df_week2_plays_modified['PlayDescription'].iloc[i]
  # play_split = play.split(". ")
  play_split = split_play_description(play)
  for j in play_split:
    print(j)
  print()

(1, [1])
— J.Hurts pass short right to D.Smith to PHI 31 for 6 yards (B.Murphy; C.Bynum).

(4, [4])
— J.Hurts pass short left to D.Goedert to PHI 36 for -1 yards (C.Bynum).

(5, [5])
— J.Hurts pass deep right to D.Smith to MIN 10 for 54 yards (Th.Jackson).

(6, [6])
— J.Hurts pass short left to A.Brown to MIN 9 for 1 yard (Th.Jackson, C.Bynum).

(11, [11])
— K.Cousins pass deep right to J.Jefferson to MIN 40 for 15 yards (D.Slay).

(12, [12])
— K.Cousins pass short right to J.Jefferson to MIN 41 for 1 yard (D.Slay).

(13, [13])
— K.Cousins pass incomplete short right to A.Mattison [J.Sweat].

(14, [14])
— K.Cousins pass incomplete short middle to K.Osborn [J.Davis].

(16, [17])
— J.Hurts pass incomplete deep right to D.Goedert.

(17, [18])
— J.Hurts pass short right to D.Goedert to PHI 16 for 6 yards (H.Smith).

(25, [27])
— J.Hurts pass short left to D.Goedert to PHI 44 for 1 yard (Th.Jackson, A.Evans).

(29, [32])
— J.Hurts pass short left to A.Brown to MIN 34 for 5 yards (J.Hicks).


### RUN PLAYS

In [136]:
# All rushing plays
df_unclean_run_plays = df_week2_plays_modified.loc[(df_week2_plays_modified['PlayOutcome'].str.contains('Run'))]

map_unclean_clean_run_plays = unclean_to_clean_play_matches(df_unclean_run_plays, df_week2_plays_cleaned)

len(map_unclean_clean_run_plays.keys())

803

In [137]:
# Every unclean run play and their associated cleaned play breakdown

for i in map_unclean_clean_run_plays.keys():
  print(f"({i}, {map_unclean_clean_run_plays.get(i)})")
  play = df_week2_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

(3, [3])
— J.Hurts scrambles right end pushed ob at PHI 37 for 7 yards (H.Phillips)
MIN-J.Metellus was injured during the play.

(7, [7])
— D.Swift right guard to MIN 8 for 1 yard (H.Phillips; D.Wonnum).

(8, [8])
— J.Hurts right tackle to MIN 6 for 2 yards (J.Hicks, C.Bynum).

(19, [20])
— J.Hurts scrambles right tackle to PHI 14 for 3 yards (D.Wonnum, J.Hicks).

(21, [23])
— D.Swift left guard to PHI 34 for 7 yards (H.Smith).

(22, [24])
— D.Swift right guard to PHI 38 for 4 yards (C.Bynum, I.Pace).

(23, [25])
— D.Swift right end to 50 for 12 yards (J.Hicks).

(28, [31])
— J.Hurts left end to MIN 39 for no gain (J.Hicks).

(30, [33])
— J.Hurts left tackle to MIN 37 for -3 yards (J.Hicks, D.Wonnum).

(36, [39])
— A.Mattison right guard to PHI 44 for -2 yards (J.Carter).

(39, [42])
— A.Mattison left end to PHI 12 for 5 yards (A.Maddox)
PHI-A.Maddox was injured during the play
He is Out.

(44, [47])
— D.Swift right guard to PHI 32 for 7 yards (I.Pace).

(45, [48])
— D.Swift right guar

### 2PT CONVERSION

In [138]:
df_2023_2pt_conversion_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('Conversion')]

# All 2PT conversion attempts
df_unclean_conversion_plays = df_week2_plays_modified.loc[(df_week2_plays_modified['PlayOutcome'].str.contains('Conversion'))]

map_unclean_clean_conversion_plays = unclean_to_clean_play_matches(df_unclean_conversion_plays, df_week2_plays_cleaned)

In [139]:
# Every unclean conversion play and their associated cleaned play breakdown

for i in map_unclean_clean_conversion_plays.keys():
  print(f"({i}, {map_unclean_clean_conversion_plays.get(i)})")
  play = df_week2_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

(1397, [1440])
— 79-T.Pipkins reported in as eligible
TWO-POINT CONVERSION ATTEMPT
J.Herbert pass to T.Pipkins is complete
ATTEMPT SUCCEEDS.

(1616, [1665])
— TWO-POINT CONVERSION ATTEMPT
J.Dobbs rushes right end
ATTEMPT SUCCEEDS.

(1947, [2004])
— TWO-POINT CONVERSION ATTEMPT
T.Pollard rushes right end
ATTEMPT SUCCEEDS.

(2116, [2177])
— TWO-POINT CONVERSION ATTEMPT
B.Robinson rushes up the middle
ATTEMPT FAILS
The Replay Official reviewed the short of the goal line ruling, and the play was REVERSED
TWO-POINT CONVERSION ATTEMPT
B.Robinson rushes up the middle
ATTEMPT SUCCEEDS.

(2227, [2292])
— TWO-POINT CONVERSION ATTEMPT
R.Wilson pass to C.Sutton is incomplete
ATTEMPT FAILS.

(2566, [2644])
— TWO-POINT CONVERSION ATTEMPT
B.Young pass to A.Thielen is complete
ATTEMPT SUCCEEDS.

(2618, [2700])
— TWO-POINT CONVERSION ATTEMPT
J.Ford rushes left guard
ATTEMPT SUCCEEDS.

(2645, [2728])
— TWO-POINT CONVERSION ATTEMPT
N.Harris rushes up the middle
ATTEMPT FAILS.

(2685, [2768])
— TWO-POINT 

## DEFENSIVE PLAYS

### INTERCEPTION

In [140]:
df_2023_interception_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('Interception')]

# All interception attempts
df_unclean_interception_plays = df_week2_plays_modified.loc[(df_week2_plays_modified['PlayOutcome'].str.contains('Interception'))]

map_unclean_clean_interception_plays = unclean_to_clean_play_matches(df_unclean_interception_plays, df_week2_plays_cleaned)

In [141]:
# Every unclean interception play and their associated cleaned play breakdown

for i in map_unclean_clean_interception_plays.keys():
  print(f"({i}, {map_unclean_clean_interception_plays.get(i)})")
  play = df_week2_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

(26, [28, 29])
— J.Hurts pass deep middle intended for D.Smith INTERCEPTED by Th.Jackson at MIN 35
Th.Jackson to MIN 35 for no gain (D.Smith).

(186, [191, 192])
— D.Ridder pass deep middle intended for J.Smith INTERCEPTED by R.Douglas [K.Clark] at GB 42
R.Douglas to GB 40 for -2 yards (K.Pitts).

(364, [375, 376])
— J.Garoppolo pass short middle intended for A.Abdullah INTERCEPTED by T.Bernard (G.Rousseau) [D.Jones] at LV 28
T.Bernard to LV 28 for no gain (A.Abdullah).

(451, [465, 466])
— T.Munford reported in as eligible
J.Garoppolo pass short right intended for J.Jacobs INTERCEPTED by M.Milano at BUF 49
M.Milano to LV 48 for 3 yards (J.Jacobs).

(591, [607, 608])
— J.Burrow pass short middle intended for T.Higgins INTERCEPTED by G.Stone at BAL 2
G.Stone ran ob at BAL 38 for 36 yards (J.Burrow).

(1073, [1104, 1105])
— P.Mahomes pass deep middle intended for Ju.Watson INTERCEPTED by A.Cisco [J.Allen] at JAX 12
A.Cisco to JAX 12 for no gain (Ju.Watson).

(1345, [1384, 1385])
— J.Fiel

### SACK

In [142]:
df_2023_sack_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('Sack')]

# All sacks
df_unclean_sack_plays = df_week2_plays_modified.loc[(df_week2_plays_modified['PlayOutcome'].str.contains('Sack'))]

map_unclean_clean_sack_plays = unclean_to_clean_play_matches(df_unclean_sack_plays, df_week2_plays_cleaned)

In [143]:
# Every unclean sacked play and their associated cleaned play breakdown

for i in map_unclean_clean_sack_plays.keys():
  print(f"({i}, {map_unclean_clean_sack_plays.get(i)})")
  play = df_week2_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

(2, [2])
— J.Hurts sacked at PHI 30 for -1 yards (sack split by I.Pace and H.Phillips).

(24, [26])
— J.Hurts sacked at PHI 43 for -7 yards (D.Hunter).

(32, [35])
— K.Cousins sacked at MIN 38 for -7 yards (J.Davis).

(130, [133])
— J.Hurts sacked at MIN 37 for -2 yards (D.Hunter).

(131, [134])
— J.Hurts sacked at MIN 49 for -12 yards (D.Hunter).

(189, [195])
— J.Love sacked at GB 28 for -11 yards (K.Elliss).

(287, [295])
— D.Ridder sacked at GB 15 for -2 yards (sack split by R.Gary and K.Clark).

(359, [370])
— J.Allen sacked ob at BUF 30 for 0 yards (R.Spillane).

(458, [473])
— J.Allen sacked at LV 20 for -7 yards (D.Deablo).

(630, [647])
— J.Burrow sacked at CIN 24 for -6 yards (J.Clowney).

(801, [822])
— J.Goff sacked at DET 35 for -5 yards (T.Brown).

(806, [828])
— J.Goff sacked at DET 48 for -2 yards (D.Jones).

(819, [842])
— G.Smith sacked at SEA 3 for -17 yards (A.Anzalone).

(880, [906])
— C.Stroud sacked at HOU 28 for -9 yards (E.Speed).

(941, [969])
— C.Stroud sacke

## SPECIAL TEAMS PLAYS

### PUNTS

In [144]:
df_2023_punt_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('Punt')]

# All punts
df_unclean_punt_plays = df_week2_plays_modified.loc[(df_week2_plays_modified['PlayOutcome'].str.contains('Punt')) &
                                                    (df_week2_plays_modified['PlayDescription'].str.contains('FUMBLE'))]

map_unclean_clean_punt_plays = unclean_to_clean_play_matches(df_unclean_punt_plays, df_week2_plays_cleaned)

In [145]:
# Every unclean punt play and their associated cleaned play breakdown

for i in map_unclean_clean_punt_plays.keys():
  print(f"({i}, {map_unclean_clean_punt_plays.get(i)})")
  play = df_week2_plays_modified['PlayDescription'].iloc[i]
  play_split = split_play_description(play)
  for j in play_split:
    print(j)
  print()

(15, [15, 16])
— R.Wright punts 51 yards to PHI 8, Center-A.DePaola
B.Covey to PHI 16 for 8 yards (T.Dye)
FUMBLES (T.Dye), recovered by PHI-K.Ringo at PHI 10.

(1227, [1263, 1264])
— T.Gill punts 47 yards to TB 17, Center-P.Scales
D.Thompkins to TB 26 for 9 yards (N.Sewell; D.Cole)
FUMBLES (N.Sewell), ball out of bounds at TB 23
Penalty on TB-K.Britt, Defensive Offside, declined.



### KICKOFFS

In [146]:
df_2023_kickoff_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('Kickoff')]

# All punts
df_unclean_kickoff_plays = df_week2_plays_modified.loc[(df_week2_plays_modified['PlayOutcome'].str.contains('Kickoff')) &
                                                       (df_week2_plays_modified['PlayDescription'].str.contains('FUMBLES'))]

map_unclean_clean_kickoff_plays = unclean_to_clean_play_matches(df_unclean_kickoff_plays, df_week2_plays_cleaned)

In [147]:
# Every unclean kickoff play and their associated cleaned play breakdown

for i in map_unclean_clean_kickoff_plays.keys():
  print(f"({i}, {map_unclean_clean_kickoff_plays.get(i)})")
  play = df_week2_plays_modified['PlayDescription'].iloc[i]
  play_split = split_play_description(play)
  for j in play_split:
    print(j)
  print()

## SCORING PLAYS

### TOUCHDOWNS

In [148]:
df_2023_touchdown_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('Touchdown')]

# All touchdowns
df_unclean_touchdown_plays = df_week2_plays_modified.loc[(df_week2_plays_modified['PlayOutcome'].str.contains('Touchdown'))]

map_unclean_clean_touchdown_plays = unclean_to_clean_play_matches(df_unclean_touchdown_plays, df_week2_plays_cleaned)

In [149]:
# Every unclean touchdown play and their associated cleaned play breakdown

for i in map_unclean_clean_touchdown_plays.keys():
  print(f"({i}, {map_unclean_clean_touchdown_plays.get(i)})")
  play = df_week2_plays_modified['PlayDescription'].iloc[i]
  play_split = split_play_description(play)
  for j in play_split:
    print(j)
  print()

(41, [44])
— K.Cousins pass short left to T.Hockenson for 5 yards, TOUCHDOWN.

(60, [63])
— J.Hurts up the middle for 1 yard, TOUCHDOWN.

(86, [89])
— J.Hurts up the middle for 1 yard, TOUCHDOWN.

(95, [98])
— J.Hurts pass deep left to D.Smith for 63 yards, TOUCHDOWN.

(102, [105])
— K.Cousins pass deep middle to J.Addison for 62 yards, TOUCHDOWN.

(141, [144])
— K.Cousins pass short left to K.Osborn for 10 yards, TOUCHDOWN [J.Sweat].

(152, [155])
— D.Swift left tackle for 2 yards, TOUCHDOWN.

(166, [169])
— K.Cousins pass short right to T.Hockenson for 5 yards, TOUCHDOWN.

(220, [227])
— J.Love pass short middle to J.Reed for 9 yards, TOUCHDOWN.

(257, [265])
— D.Ridder pass short left to D.London for 3 yards, TOUCHDOWN [Q.Walker].

(273, [281])
— J.Love pass short middle to D.Wicks for 32 yards, TOUCHDOWN.

(295, [304])
— R.Walker reported in as eligible
J.Love pass short right to J.Reed for 10 yards, TOUCHDOWN.

(307, [316])
— D.Ridder left end for 6 yards, TOUCHDOWN.

(354, [365])

### FIELD GOALS

In [150]:
df_2023_field_goal_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('Field Goal')]

# All field goals
df_unclean_field_goal_plays = df_week2_plays_modified.loc[(df_week2_plays_modified['PlayOutcome'].str.contains('Field Goal')) &
                                                          (df_week2_plays_modified['PlayDescription'].str.contains('BLOCKED'))]

map_unclean_clean_field_goal_plays = unclean_to_clean_play_matches(df_unclean_field_goal_plays, df_week2_plays_cleaned)

In [151]:
# Every unclean field goal play and their associated cleaned play breakdown

for i in map_unclean_clean_field_goal_plays.keys():
  print(f"({i}, {map_unclean_clean_field_goal_plays.get(i)})")
  play = df_week2_plays_modified['PlayDescription'].iloc[i]
  play_split = split_play_description(play)
  for j in play_split:
    print(j)
  print()

(1223, [1259])
— C.McLaughlin 40 yard field goal is BLOCKED (R.Green), Center-Z.Triner, Holder-J.Camarda, recovered by TB-J.Camarda at 50.

(2333, [2400, 2401])
— J.Sanders 49 yard field goal is BLOCKED (B.Schooler), Center-B.Ferguson, Holder-J.Bailey, RECOVERED by NE-K.Dugger at NE 47
K.Dugger to MIA 49 for 4 yards (C.Wilkins).



### EXTRA POINTS

In [152]:
df_2023_extra_point_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('Extra Point')]

# All field goals
df_unclean_extra_point_plays = df_week2_plays_modified.loc[(df_week2_plays_modified['PlayOutcome'].str.contains('Extra Point'))]

map_unclean_clean_extra_point_plays = unclean_to_clean_play_matches(df_unclean_extra_point_plays, df_week2_plays_cleaned)

In [153]:
# Every unclean field goal play and their associated cleaned play breakdown

for i in map_unclean_clean_extra_point_plays.keys():
  print(f"({i}, {map_unclean_clean_extra_point_plays.get(i)})")
  play = df_week2_plays_modified['PlayDescription'].iloc[i]
  play_split = split_play_description(play)
  for j in play_split:
    print(j)
  print()

(42, [45])
— G.Joseph extra point is GOOD, Center-A.DePaola, Holder-R.Wright.

(61, [64])
— J.Elliott extra point is GOOD, Center-R.Lovato, Holder-A.Siposs.

(87, [90])
— J.Elliott extra point is GOOD, Center-R.Lovato, Holder-A.Siposs.

(96, [99])
— J.Elliott extra point is GOOD, Center-R.Lovato, Holder-A.Siposs.

(103, [106])
— G.Joseph extra point is GOOD, Center-A.DePaola, Holder-R.Wright.

(142, [145])
— G.Joseph extra point is GOOD, Center-A.DePaola, Holder-R.Wright.

(153, [156])
— J.Elliott extra point is GOOD, Center-R.Lovato, Holder-A.Siposs.

(167, [170])
— G.Joseph extra point is GOOD, Center-A.DePaola, Holder-R.Wright.

(221, [228])
— A.Carlson extra point is GOOD, Center-M.Orzech, Holder-D.Whelan.

(258, [266])
— Y.Koo extra point is No Good, Wide Left, Center-L.McCullough, Holder-B.Pinion.

(274, [282])
— A.Carlson extra point is GOOD, Center-M.Orzech, Holder-D.Whelan.

(296, [305])
— A.Carlson extra point is GOOD, Center-M.Orzech, Holder-D.Whelan.

(308, [317])
— Y.Koo e

## OTHER

### LATERALS

In [154]:
# All lateral plays
df_unclean_lateral_plays = df_week2_plays_modified.loc[(df_week2_plays_modified['PlayDescription'].str.contains('lateral', case=False))]

map_unclean_clean_lateral_plays = unclean_to_clean_play_matches(df_unclean_lateral_plays, df_week2_plays_cleaned)

In [155]:
# Every unclean lateral and their associated cleaned play breakdown

for i in map_unclean_clean_lateral_plays.keys():
  print(f"({i}, {map_unclean_clean_lateral_plays.get(i)})")
  play = df_week2_plays_modified['PlayDescription'].iloc[i]
  play_split = split_play_description(play)
  for j in play_split:
    print(j)
  print()

(2394, [2466])
— M.Jones pass short left to M.Gesicki to MIA 34 for -1 yards [B.Chubb]
Lateral to C.Strange to MIA 29 for 5 yards (J.Holland; A.Van Ginkel)
The Replay Official reviewed the first down ruling, and the play was REVERSED
(Shotgun) M.Jones pass short left to M.Gesicki to MIA 34 for -1 yards [B.Chubb]
Lateral to C.Strange to MIA 30 for 4 yards (J.Holland; A.Van Ginkel).



### FUMBLES

In [156]:
# All fumble plays
df_unclean_fumble_plays = df_week2_plays_modified.loc[(df_week2_plays_modified['PlayOutcome'].str.contains('fumble', case=False))]

map_unclean_clean_fumble_plays = unclean_to_clean_play_matches(df_unclean_fumble_plays, df_week2_plays_cleaned)

In [157]:
# Every unclean fumble and their associated cleaned play breakdown

for i in map_unclean_clean_fumble_plays.keys():
  print(f"({i}, {map_unclean_clean_fumble_plays.get(i)})")
  play = df_week2_plays_modified['PlayDescription'].iloc[i]
  play_split = split_play_description(play)
  for j in play_split:
    print(j)
  print()

(20, [21, 22])
— A.Siposs punts 40 yards to MIN 46, Center-R.Lovato
B.Powell to PHI 34 for 20 yards (J.Evans)
FUMBLES (J.Evans), RECOVERED by PHI-N.Morrow at PHI 27.

(27, [30])
— A.Mattison left tackle to MIN 37 for 2 yards (A.Maddox; J.Jobe)
FUMBLES (A.Maddox), RECOVERED by PHI-J.Evans at MIN 39
J.Evans to MIN 39 for no gain (A.Mattison).

(71, [74])
— K.Cousins pass deep left to J.Jefferson to PHI 1 for 30 yards (T.Edmunds)
FUMBLES (T.Edmunds), ball out of bounds at PHI 1
The Replay Official reviewed the ball was out of bounds ruling, and the play was REVERSED
(Shotgun) K.Cousins pass deep left to J.Jefferson to PHI 1 for 30 yards (T.Edmunds)
FUMBLES (T.Edmunds), ball out of bounds in End Zone, Touchback.

(84, [87])
— K.Cousins sacked at MIN 18 for -8 yards (J.Sweat)
FUMBLES (J.Sweat) [J.Sweat], RECOVERED by PHI-F.Cox at MIN 15
F.Cox to MIN 7 for 8 yards (E.Ingram).

(491, [506])
— Z.White left tackle to BUF 15 for no gain (D.Jackson)
FUMBLES (D.Jackson), RECOVERED by BUF-T.Rapp at

### TURNOVER ON DOWNS

In [158]:
# All turnover on downs plays
df_unclean_turnover_on_downs_plays = df_week2_plays_modified.loc[(df_week2_plays_modified['PlayOutcome'].str.contains('Turnover on Downs'))]

map_unclean_clean_turnover_on_downs_plays = unclean_to_clean_play_matches(df_unclean_turnover_on_downs_plays, df_week2_plays_cleaned)

In [159]:
# Every unclean field goal play and their associated cleaned play breakdown

for i in map_unclean_clean_turnover_on_downs_plays.keys():
  print(f"({i}, {map_unclean_clean_turnover_on_downs_plays.get(i)})")
  play = df_week2_plays_modified['PlayDescription'].iloc[i]
  play_split = split_play_description(play)
  for j in play_split:
    print(j)
  print()

(1086, [1118])
— T.Lawrence sacked at 50 for -5 yards (C.Jones).

(1361, [1402])
— R.Tannehill sacked at TEN 48 for -6 yards (K.Murray).

(1854, [1910])
— K.Williams left tackle to LAR 38 for -1 yards (N.Bosa).

(2394, [2466])
— M.Jones pass short left to M.Gesicki to MIA 34 for -1 yards [B.Chubb]
Lateral to C.Strange to MIA 29 for 5 yards (J.Holland; A.Van Ginkel)
The Replay Official reviewed the first down ruling, and the play was REVERSED
(Shotgun) M.Jones pass short left to M.Gesicki to MIA 34 for -1 yards [B.Chubb]
Lateral to C.Strange to MIA 30 for 4 yards (J.Holland; A.Van Ginkel).



## INDEX SEARCHING

In [170]:
df_week2_plays_cleaned.iloc[963]

Unnamed: 0,963
Season,2023
Week,Week 2
Day,SUN
Date,09/17
AwayTeam,IND
HomeTeam,HOU
Quarter,3
DriveNumber,1
TeamWithPossession,HOU
IsScoringDrive,0
