<a href="https://colab.research.google.com/github/KeoniM/NFL_Data_Cleaning/blob/main/NFL_Plays_Week1_2023.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**PURPOSE:**
- Accurately clean a week's worth of play data
  - Season 2023 -> Week 1

**CONCERNS FOR LATER:**

*General*

- Players with the same name
- Cleaning check
  - Figure out some type of method that will help decern whether these plays have been cleaned correctly.
    - Cross reference recorded NFL stats with stats here and compare likeness.
- FUMBLE RECOVERIES
  - Change the 'PlayType' and 'PlayOutcome' of fumble recoveries
  - SITUATION: Running back fumbles on a run play but recovers it and rushes for x yards.
    - This would still count towards his rushing yards.
    - 'PlayType' = 'Run'
    - 'PlayOutcome' = 'X Yard Run'
      - 2 rows will be present for this type of play. 1 before fumble and 1 after fumble. Each will have their own separate 'PlayOutcome'..?
  - SITUATION: Any fumble recovery that is not the runningback on an intended running play
    - This would not count as rushing yards for the player who recovered the fumble.
    - 'PlayType' = 'Fumble Return'
    - 'PlayOutcome' = 'X Yard Fumble Return'
- Fix playoutcomes and playtypes for plays that have been split up into multiple rows.
  - For example, If a team throws an interception and that interception results in a touchdown for the opposing team, I do not think it should be considered as a 'scoring drive' for the team that threw the interception.
- Should I broaden 'playtypes' to include:
  1. yardage after fumble (Currently have it as 'Run' playtype)
  2. yardage after interception (Currently have it as 'Interception')
- Condense features.
- Condense regular expressions to grab multiple pieces of wanted data instead of individual.

*Offense*

- Trick plays
- Latterals

*Defense*

- Nuance of players recorded for sacks & forced fumbles
  - Look under sack play type cleaning method
    - The formatting of multiple defending players in on a fumbled play may cause wrong recording of data (e.i. player who assisted in tackle may be credited for the forced fumble)

- DEFENSIVE STATS ARE CURRENTLY WRONG
  - I will work on 0.5 tackes, solo tackles and assists. I need to adjust cleaning methods to collect this data better.
    - ';' means solo and assit tackle
    - ',' means 0.5 tackle

LATER IDEAS:
- Use 'Fuzzywuzzy' to find like play outcomes.
  - This will give me a chance to automate play types instead of eying them and seaparating them manually.
- Map team name with their abbreviations ( e.g. "Cowboys" <-> "DAL" )
  - Maybe with larger datasets with multiple weeks, I can map team names with team abbrevations that match up the most.
- For the category "isScoringDrive" the categories could be:
  1. 0 - Is not a scoring drive
  2. 1 - Is scoring drive for team on offense
  3. 2 - Is scoring drive for team on defense
- Shorten cleaning methods by creating a helper method to grab data from the defense on a play

# MOUNTING AND IMPORTS

In [None]:
# Mount your Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Used to access personal google cloud services
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


In [None]:
# Imports

# Data manipulation
import pandas as pd

# Regular expressions
import re

# Grab data from database
from google.cloud import bigquery

In [None]:
# # debugger (maybe use in the future)
# %pdb on

# LOADING DATA (BigQuery queries)

In [None]:
# Client connect to bigquery project
client = bigquery.Client('nfl-data-430702')

## Season 2023 Week 1

In [None]:
# Grabbing all plays from 2023 Week 1 NFL Sesason
week1_2023_plays_query = """
                         SELECT *
                         FROM `nfl-data-430702.NFL_Scores.NFL-Plays-Week1_2023`
                         """

# Running psuedo query, and returns the amount of bytes it will take to run query
dry_run_config = bigquery.QueryJobConfig(dry_run=True)
dry_run_query = client.query(week1_2023_plays_query, job_config=dry_run_config)
print("This query will process {} bytes.".format(dry_run_query.total_bytes_processed))

# Running query (Being mindful of the amount of data being grabbed)
# Will grab a maximum of a Gigabyte
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**9)
safe_config_query = client.query(week1_2023_plays_query, job_config=safe_config)

This query will process 570194 bytes.


In [None]:
# Putting data attained from query into a dataframe
week1_2023_plays = safe_config_query.to_dataframe()

In [None]:
week1_2023_plays.head()

Unnamed: 0,Season,Week,Day,Date,AwayTeam,HomeTeam,Quarter,DriveNumber,TeamWithPossession,IsScoringDrive,PlayNumberInDrive,IsScoringPlay,PlayOutcome,PlayDescription,PlayStart
0,2023,Week 1,MON,09/11,Bills,Jets,1ST QUARTER,1,BUF,0,1,0,Kickoff,G.Zuerlein kicks 65 yards from NYJ 35 to end z...,Kickoff from NYJ 35
1,2023,Week 1,MON,09/11,Bills,Jets,1ST QUARTER,1,BUF,0,2,0,7 Yard Pass,(15:00) (Shotgun) J.Allen pass short right to ...,1st & 10 at BUF 25
2,2023,Week 1,MON,09/11,Bills,Jets,1ST QUARTER,1,BUF,0,3,0,5 Yard Pass,"(14:34) (No Huddle, Shotgun) J.Allen pass shor...",2nd & 3 at BUF 32
3,2023,Week 1,MON,09/11,Bills,Jets,1ST QUARTER,1,BUF,0,4,0,3 Yard Run,(14:01) J.Cook up the middle to BUF 40 for 3 y...,1st & 10 at BUF 37
4,2023,Week 1,MON,09/11,Bills,Jets,1ST QUARTER,1,BUF,0,5,0,2 Yard Run,(13:24) (Shotgun) J.Cook up the middle to BUF ...,2nd & 7 at BUF 40


In [None]:
# Noting the original size of the raw uncleaned dataframe of data
# - (rows, columns)
week1_2023_plays.shape

(2600, 15)

# CATEGORIZE PLAYS
- The goal here is to parse out the different values for 'PlayOutcome'
  - This is where I will separate different types of plays
    - ( pass / run / kickoff / etc. )

In [None]:
# Maybe try to fuzzywuzzy this in the future?
# - I need to narrow these down into basic categories.

# All play outcomes from the game
# - From here we can categorize and clean plays accordingly
week1_2023_plays['PlayOutcome'].unique()

array(['Kickoff', '7 Yard Pass', '5 Yard Pass', '3 Yard Run',
       '2 Yard Run', 'Pass Incomplete', 'Punt', '-5 Yard Penalty',
       '5 Yard Run', '1 Yard Pass', '14 Yard Run', '3 Yard Pass',
       '8 Yard Run', '6 Yard Pass', '15 Yard Pass', '-9 Yard Sack',
       '4 Yard Pass', '13 Yard Pass', 'Field Goal', '-2 Yard Sack',
       'Interception', '-5 Yard Run', '18 Yard Pass', '8 Yard Pass',
       '6 Yard Run', '12 Yard Run', '-1 Yard Run', '26 Yard Pass',
       'Touchdown Bills', 'Extra Point Good', '13 Yard Run',
       '-3 Yard Sack', '7 Yard Run', '9 Yard Pass', '4 Yard Run',
       'Fumble', '-10 Yard Penalty', '10 Yard Pass', '26 Yard Run',
       '5 Yard Penalty', '-10 Yard Sack', '22 Yard Pass', '-4 Yard Run',
       '-12 Yard Sack', '83 Yard Run', '1 Yard Run', '2 Yard Pass',
       '10 Yard Run', 'Run for No Gain', '12 Yard Pass', '20 Yard Pass',
       '9 Yard Run', '-2 Yard Pass', 'Sack', '24 Yard Pass',
       '14 Yard Pass', 'Touchdown Jets', '-3 Yard Run', '-2 Yar

In [None]:
# NOTES:
# - There are more play types that I have not made yet for Week 1.
# - Currently, I am eyeing at all unique play outcomes to categorizing them.
#   - This type of approach is not flexable because a play outcome can
#     arise that has not been seen yet.
#     - There may be more play outcomes in the future when working on a full season,
#       let alone all seasons and future games

# Play Types with complete cleaning methods
df_2023_pass_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Pass')]
df_2023_run_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Run')]
df_2023_interception_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Interception')]
df_2023_sack_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Sack')]
df_2023_punt_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Punt')]

# Play Types currently working on
df_2023_kickoff_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Kickoff')]
df_2023_touchdown_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Touchdown')]


# Play types need to work on
# df_2023_fumble_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Fumble')] <--- NOTES:
# - Maybe for this one I can separate them into different categories of playtypes and have them all cleaned at once.
#   e.g. group all run playtypes and have them cleaned. Then group all passing play types and have them all cleaned.
# df_2023_penalty_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Penalty')]
# df_2023_fieldgoal_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Field Goal')]
# df_2023_extrapoint_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Extra Point')]

# plays_list = [df_2023_pass_sb,
#               df_2023_run_sb,
#               df_2023_punt_sb,
#               df_2023_sack_sb,
#               df_2023_kickoff_sb,
#               df_2023_fumble_sb,
#               df_2023_interception_sb,
#               df_2023_penalty_sb,
#               df_2023_fieldgoal_sb,
#               df_2023_touchdown_sb,
#               df_2023_extrapoint_sb]

## SANITY CHECK (All Plays Accounted for)
- NOT COMPLETE
  - Still need to grab other play types
    - Once all plays have been categorizing, will compare the sum to the size of the original dataframe of plays

In [None]:
# Empty for now.

# HELPER METHODS (personal use)
- For personal use, does not actually take part in cleaning dataset at all.

In [None]:
# PURPOSE:
# - Quick look at a section of plays
#   - Ideally the plays that the user wants to break down and clean.
# INPUT PARAMETERS:
# df_all_plays      - DataFrame - The original dataframe where the desired plays to view came from
# df_section_plays  - DataFrame - A section of the original dataframe the user wants to view
# RETURN:
# - Printing to the console:
#   1. index of play
#   2. 'PlayDescription' feature of play
#   3. 'PlayOutcome' feature of play
def print_plays(df_all_plays, df_section_plays):
  for idx, value in df_section_plays['PlayOutcome'].items():
    play = df_all_plays['PlayDescription'].iloc[idx]
    print("index:" + str(idx))
    for i in play.split(". "):
      print(i)
    print(value)
    print()

# PIPELINE
  - ORDER
    1. Regular expressions
      - Used to find common patterns within raw data
    1. Cleaning methods
      - Unique cleaning methods for each play type
        - Some methods may include helper methods
    2. Main pipeline method
      - Control flow of cleaning methods



## 1. REGULAR EXPRESSIONS

In [100]:
####################################################
# REGULAR EXPRESSIONS USED TO LOCATE SPECIFIC DATA #
####################################################

###########
# GENERAL #
###########

# Players name (Grabs every variation come across so far)
name_pattern = "(?:[A-Za-z]+-)*[A-Za-z]+\.[A-Za-z]+(?:-[A-Za-z]+)*"

################
# PLAY DETAILS #
################

# Play start time
time_on_clock_pattern = r'\((\d*:\d+)\)'

# Offense play formation
formation = r'\(([A-Za-z]+ ?[A-Za-z]*,? ?[A-Za-z]*)\)'

# Yards gained on play
yardage_gained = r'for (-?[0-9]+) yards?'

###########
# OFFENSE #
###########

# Passer (Player passing, Player spiking, Player who got sacked)
passer_name_pattern = f"({name_pattern}) (?:pass|spiked|sacked)"

# Rushing play (Player running ball)
rusher_pattern = f"({name_pattern})(?: scrambles)? (?:left|right|up|kneels).?"

# Pass play (Returns intended receiver and the direction of the pass)
receiver_pattern = f"(short|deep) (left|right|middle) (?:to|intended for) ({name_pattern})"

###########
# DEFENSE #
###########

# Tackles (solo, assist, shared) <-- the goal. Right now all I have is tackle1 and tackle2

# Main defender on play (Used to grab tackler1 and used to grab players that sacked the passer)
defense_tackler_1_name_pattern = f"\(({name_pattern})"

# Second defender on play (Used to grab tackler2)
defense_tackler_2_name_pattern = f" ({name_pattern})\)" # Will have a ")" at the end of the name

# Pressure (Who applied pressure to passer)
# - I think it might be possible for multiple defenders to apply pressure to the passer.
defense_pressure_name_pattern = f"\[({name_pattern})\]"

# Interception (Player who intercepted pass)
interception_name_pattern = f"INTERCEPTED by ({name_pattern})"

# Quarterback Fumbles (Quarterback fumble solo, Quarterback fumble solo -> who recovers, Quarterback <-> Center discrepancy)

# How far passer went before fumbling on his own
qb_fumble_pattern = f" ({name_pattern}) to(?: [A-Z]+) [0-9]+ for -?[0-9]+ yards$" # Passer fumbles are always the initial action of the play

# Action directly after a quarterback only fumble
qb_fumble_description_pattern = f"^FUMBLES, "

# Fumble missnap (Will either be the quarterback or center.)
aborted_fumble_pattern = f"({name_pattern}) FUMBLES"

# Forced fumbles (Player who forced the fumble)
forced_fumble_pattern = f"FUMBLES \(({name_pattern})\)"

# Sack (Who is credited with a sack, who split sack, how many yards was the sack)

# Fumble from sack (Player who forced the fumble on a sack)
sacked_forced_fumble_sentence = f"FUMBLES \({name_pattern}\) \[({name_pattern})\]"

# Split sack (Players who equally received credit for sack)
split_sack_pattern = f"sack split by ({name_pattern}) and ({name_pattern})"

# Yardage of sack (starting from line of scrimmage)
yardage_from_sack = r'sacked(?: ob)? at(?: [A-Z]+)? [0-9]+ for (-?[0-9]+) yards'

# Defense takeaway (takeaway for yardage)
defensive_takeaway_run_pattern = f"^({name_pattern}) (?:pushed ob at|ran ob at|to)(?: [A-Z]+) -?[0-9]+ for " # yardage after fumble recovery & yardage after interception

# Defense takeaway (takeaway for touchdown)
touchdown_after_takeaway_pattern = f"({name_pattern}) for [0-9]+ yards, TOUCHDOWN" # touchdown after a fumble recovery or interception

#################
# SPECIAL TEAMS #
#################

# Punting play (Who was the punter, How many yards the ball went, Who was the Longsnapper)
punting_pattern = f"({name_pattern}) punts (-?[0-9]+) yards? to(?: [A-Z]+ -?[0-9]+| -?[0-9]+| end zone), Center-({name_pattern})"

# Punt return (Who was returning the punt, How many yards did they go, The player(s) that tackled the returner)
punt_return_pattern = f"({name_pattern}) (?:pushed ob at|ran ob at|to)(?: [A-Z]+)? [0-9]+ for (-?[0-9]+) yards? \(({name_pattern})(?:(?:,|;) ({name_pattern}))?\)" # yardage after punt

# J.Reed (didn't try to advance) to CHI 44 for no gain.
kick_return_pattern = f"({name_pattern})(?: \(didn't try to advance\))? (?:pushed ob at|ran ob at|to)(?: [A-Z]+)? [0-9]+ for (no gain|(-?[0-9]+) yards?) \(({name_pattern})(?:(?:,|;) ({name_pattern}))?\)" # yardage after kickoff

# Punt return resulting in fair catch
punt_fair_catch_pattern = f", fair catch by ({name_pattern})"

# Punt or kickoff downed by
kick_downed_by_pattern = f", downed by ({name_pattern})"

# Kickoff play (Who was the kicker, How many yards the ball was kicked )
kickoff_pattern = f"({name_pattern}) kicks(?: onside)? (-?[0-9]+) yards from"

##############
#  INJURIES  #
##############

# Injuries (Returns the player(s) who go injuried during play)
injury = f"[A-Z]+-({name_pattern}) was injured during the play"

## 2. CLEANING METHODS

#### Universal helper method for fumbles
Currently retreives fumble data from these play types:
1. Running fumbles
2. Passing fumbles
3. Sacked fumbles

In [None]:
# PURPOSE:
# - Universal helper method that extracts fumbled data from every playtype.

# BASIC PLAN:
# 1. Accept a single row of a play that has been fumbled from the main dataframe of plays.
# 2. Replace that single row with a dataframe containing all extracted data.
#    - These replacement dataframes are not limited to a single row but can be many, depending on the play.

# BASIC DESIGN STEP BY STEP:
# 1. Split play description into significant actions and put into a list
#    EXAMPLES:
#    - intended play
#    - fumble recovery for yardage
# 2. Clean significant actions as their own rows
#    EXAMPLE METHODS USED TO CLEAN:
#    - main cleaning method (method used to clean a playtype that is using this helper method)
#    - run playtype cleaning method (Will be used to clean all fumble recoveries for yardage)
# 3. Create dataframe containing all cleaned significant actions

# INPUT PARAMETERS:
# df_plays                  - dataframe - dataframe of plays
# play                      -  String   - 'PlayDescription' of the current play that is being cleaned
# play_index                -  Integer  - index of play (Almost always from main dataframe of plays)
# main_action_patterns      -    list   - A list of regular expressions that are meant to pinpoint primary
#                                         actions within a play that will be used to extract these actions
#                                         to create a row within the replacement dataframe
# main_cleaning_method      - function  - A callback function (the function using this helper method) which
#                                         is used to clean intended play actions

# RETURN:
# df_multi_row_play - dataframe - dataframe of organized and cleaned actions stemming from a single unclean fumbled play

# NOTE: I need to comment effectively, grabbing all the nuances of what is being grabbed
#       for each playtype. All playtypes are different and need to be described.

# CONCERNS:
# 1. Nuance on sacked plays
#    - Formatting of defender who caused sack is different from a solo and an assisted
# 2. Who is at fault for aborted plays
#    - Formatting on aborted plays is different if the fault lands on the center or passer

# def extract_fumble_data(df_plays, play, play_index, main_action_patterns, secondary_action_patterns, main_cleaning_method):
def extract_fumble_data(df_plays, play, play_index, main_action_patterns, main_cleaning_method):

  # The reasoning for "df_play.index.tolist().index(play_index)"
  # - 'df_plays' does not always mean the entire main dataframe of plays,
  #   sometimes it may be a slice of the main dataframe of plays.
  #   - 'play_index' will almost always give the index of the play within the main dataframe of plays.
  #      - 'original_play_copy' will not be the correct play if 'df_plays' is a slice of the main dataframe of plays
  #        AND 'play_index' is from the main dataframe of plays. ('df_plays' could have length 10 and 'play_index' could be 2000)
  original_play_copy = df_plays.iloc[df_plays.index.tolist().index(play_index)]

  # Breaking play description into a list of sentences
  play_elements = play.split(". ")

  #################
  # KEY VARIABLES #
  #################

  # 'play_split' info:
  # - Designed to be a 2D list (list of lists)
  # - All elements within this list together will represent a single play.
  # - Each element within the list will become a separate row that will replace/add to the original dataframe of plays.
  #   - Each element represents a distict action within the single play and will have all data required for that new row.
  #   ROW CONTENTS:
  #   1. [ ( The intended play ) + ( Extra data ) , ( Who caused the fumble ) ]   <-  This row will have extra info such as (injuries / penalties / eligibility / etc...)
  #                                                                                   - "The intended play" includes 'Aborted' plays
  # ~ 2. [            ( The fumble recovery )     , ( Who caused the fumble ) ]   <-  This can happen repeatedly or not at all
  # ~ 3. [ (The fumble recovery for a touchdown) ]                                <-  This can only happen once for a single play or not at all
  play_split = []

  # 'extra_data' info:
  # - Will be a single string containing all additional data from the play such as (injuries / penalties / eligibility / etc...)
  # - Will be put into a single row dataframe and cleaned
  #   - Once extra data has been cleaned, the single row (now cleaned) dataframe will serve as a shell for
  #     the first new row that will replace the old play within the main dataframe.
  #     - This first new row will have the initial action of the play as well as all additional information from the play
  extra_data = ""

  # - Iterate through each element within play_elements
  # - NOTE: We are iterating through actions of the play cronologically
  for string in play_elements:

    ######################################
    # ORGANIZING KEY ACTIONS WITHIN PLAY #
    ######################################

    # ACTIONS WITHIN PLAY THAT DESERVE THEIR OWN ROW:
    # These situations will have their own list element within "play_split" (meaning their own row within the new cleaned replacement dataframe)
    # 1. intended play (initial action might be a better name for plays such as ones that have been aborted)
    #    RUN PLAYS:
    #     - Fumbles after inteded run play
    #     - Aborted fumbles
    #     - qb only fumbles
    #    SACKED PLAYS:
    #     - fumbles after sack
    #    PASSING PLAYS:
    #     - Fumbles after intended pass play
    #     - qb only fumbles
    # 2. runs after fumble recoveries (emphasis on the plural)
    # 3. touchdown after fumble recovery (can only happen once) (looks unique for each playtype) <- this might not be true.
    #    ! ! ! ATTENTION ! ! !
    #    - I have a small sample size for this.
    #    - This is one thing that I need to double check correctness on later in the future when having a larger sample size.
    #    RUNS PLAYS:
    #    - Are fumble recovery touchdown from run plays accounted for?
    #    SACKED PLAYS
    #    - touchdown after a sacked play
    #    PASSING PLAYS:
    #    - Are fumble recovery touchdown from passing plays accounted for?
    #
    #    - Are all fumble recoveries the same? wouldn't they all be rushing playtypes?
    for play_pattern in main_action_patterns:
      if re.search(play_pattern, string) != None:
        play_split.append([string])
        break
    if re.search(play_pattern, string) != None:
      continue

    # ADD ON SECTION (Actions that will add to elements that will obtain their own row)
    # - Appends data to elements within 'play_split'
    #   - Every element within play_split is a list, this section will add to those individual lists
    #     - Specifically it will append to the last element within 'play_split' and the reason for that
    #       is because as we are iterating through sentences cronologically, the appending element
    #       will always follow directly after the element that needs it
    # These situations will add to the last element within 'play_split' (For all playtypes)
    # 1. forced fumble description (happens after regular plays & sometimes after fumble recoveries)
    # 2. fumble description describing a qb only fumble (happens after a qb only fumble)
    for play_pattern in [forced_fumble_pattern, qb_fumble_description_pattern]:
      if re.search(play_pattern, string):
        index_last_element = len(play_split) - 1
        play_split[index_last_element] = [play_split[index_last_element][0], string]
        break
    if re.search(play_pattern, string) != None:
      continue

    # When a sentence does not fit within the top 2 sections ( 1. adding an element to the list || 2. appending to an element in the list )
    # - Glue the sentence into 'extra_data' to be cleaned separately.
    extra_data = extra_data + string + ". "

  ################################
  # CLEANING ACTIONS WITHIN PLAY #
  ################################

  # GRABBING: Initial action of play (e.g. Intended play / aborted fumble / qb only fumble / etc...)
  intended_play_description = play_split.pop(0)

  # Creating a single row dataframe of the original play
  unclean_original_play_copy = pd.DataFrame([original_play_copy.copy()], columns=df_plays.columns)

  # CREATING SHELL FOR: Initial action of play
  # - shell is only necessary with plays that have extra data (injuries / penalties / eligibility / etc...)
  # - extra data will only be available within the first row of the replacement dataframe
  if extra_data:
    unclean_original_play_copy['PlayDescription'] = extra_data
    unclean_original_play_copy = main_cleaning_method(unclean_original_play_copy)

  # CLEANING: Initial action of play
  # No matter what the initial action is, the description will always be the first element of the first element within 'play_split'
  unclean_original_play_copy['PlayDescription'] = intended_play_description[0]



  # May have to adjust in the future.
  # - ON SACKED PLAYS, there is nuance on the formatting of a player who caused a sack and a forced fumble.
  #   - Sometimes it'll look something like this "FUMBLES (B.Burns) [B.Burns]" <- [B.Burns] is credited with the forced fumble
  #   - less often it'll look like "FUMBLES (B.Burns)" <- B.Burns is credited with the forced fumble.
  # - ON ABORTED PLAYS, there is nuace on the formatting of a player who caused the play to be aborted.
  #   - the word "Aborted" will either be in parenthesis or without, this signals whether the center was at fault or the passer.
  #     - Need to figure out how to record this data.

  # intended play / qb only fumble
  if len(intended_play_description) > 1:
    unclean_original_play_copy['FumbleDetails'] = intended_play_description[1]
    forced_fumble = re.findall(forced_fumble_pattern, intended_play_description[1])
    if len(forced_fumble) > 0:
      unclean_original_play_copy['ForcedFumbleBy'] = forced_fumble[0]
    cleaned_original_play_copy = main_cleaning_method(unclean_original_play_copy)
  # Aborted fumble
  else:
    unclean_original_play_copy['FumbleDetails'] = intended_play_description[0]
    cleaned_original_play_copy = unclean_original_play_copy

  # FUMBLE RECOVERIES FOR YARDAGE & FUMBLE RECOVERIES FOR TOUCHDOWNS

  # Created list for the possibility of having multiple fumbles and recoveries in a single play
  list_recovery_runs = []

  for play in play_split:

    recovery_run_row = pd.DataFrame([original_play_copy.copy()], columns=df_plays.columns)

    # Recovery after fumble was fumbled
    if len(play) > 1:
      recovery_run_row['FumbleDetails'] = play[1]
      forced_fumble = re.findall(forced_fumble_pattern, play[1])
      if len(forced_fumble) > 0:
        recovery_run_row['ForcedFumbleBy'] = forced_fumble[0]

    recovery_run_row['PlayDescription'] = play[0]
    recovery_run_row['PlayOutcome'] = 'Run'
    cleaned_recovery_run_row = clean_run_plays(recovery_run_row)
    cleaned_recovery_run_row['PlayOutcome'] = original_play_copy['PlayOutcome'] # <- Maybe this isn't correct? Should 'playtype' be 'run'?
    list_recovery_runs.append(cleaned_recovery_run_row)

  ###################
  # 3.NEW DATAFRAME #
  ###################
  # - Create the cleaned replacement row(s) for the original row.

  if len(list_recovery_runs) > 0:
    df_multi_row_play = pd.DataFrame(columns=df_plays.columns)
    df_multi_row_play = pd.concat([cleaned_original_play_copy, *list_recovery_runs], ignore_index=True)
  else:
    df_multi_row_play = cleaned_original_play_copy

  return df_multi_row_play

### PASS PLAYS

In [None]:
# PURPOSE:
# - Clean all passing type plays within a given dataframe.
# INPUT PARAMETERS:
# df_plays    - dataframe - NFL plays (can include play types other than passing)
# index_start -  integer  - index where within the dataframe the method will start
#                           cleaning in ascending order.
# RETURN:
# df_plays - dataframe - the same input df_plays but with all passing play types cleaned

def clean_pass_plays(df_plays, index_start = None):

  # Adjusting df_plays to start cleaning at a specified index (index_start)
  if index_start != None:
    df_plays_adjusted = df_plays.iloc[df_plays.index.tolist().index(index_start):]
    # Locating all passing type plays within dataframe
    df_pass_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('Pass')]
  else:
    # Locating all passing type plays within dataframe
    df_pass_plays = df_plays[df_plays['PlayOutcome'].str.contains('Pass')]

  for idx, play in df_pass_plays['PlayDescription'].items():

    ################
    # Play details #
    ################

    # Play Type
    df_plays.loc[idx, 'PlayType'] = 'Pass'

    # TimeOnTheClock
    TimeOnTheClock = re.findall(time_on_clock_pattern, play)
    if len(TimeOnTheClock) > 0:
      df_plays.loc[idx, 'TimeOnTheClock'] = TimeOnTheClock[0]

    ############
    # REVERSES #
    ############

    # In 'PlayDescription' all information before the "reversed" sentence is not needed.
    # - All information before is stored within 'ReverseDetails' and the remaining is cleaned.
    if play.find('REVERSED') != -1:
      play_elements = play.split(". ")
      for i in play_elements:
        if i.find("REVERSED") != -1:
          df_plays.at[idx, 'ReverseDetails'] = play_elements[:play_elements.index(i) + 1]
          play = ". ".join(play_elements[play_elements.index(i) + 1:])
          break

    ############################
    # REPORTING IN AS ELIGIBLE #
    ############################

    # I do not think this contains any useful data so I am going to exclude it.
    if play.find('reported in as eligible') != -1:
      play_elements = play.split(". ")
      for i in play_elements:
        if i.find('reported in as eligible') != -1:
          play = ". ".join(play_elements[play_elements.index(i) + 1:])
          break

    ###########
    # FUMBLES #
    ###########

    # Additional rows may be added after certain types of fumbled passing plays.
    # - The idea here is that, in those situations, the helping method 'extract_fumble_data_pass'
    #   will return a small dataframe of the rows that the single play split into.
    #   - When this small dataframe is returned, it will replace the original play
    #     within the main dataframe of plays and then continue on cleaning the rest of the passing plays.

    if play.find('FUMBLES') != -1:
      main_action_patterns = [passer_name_pattern, qb_fumble_pattern, defensive_takeaway_run_pattern]
      main_cleaning_method = clean_pass_plays
      df_replacement_rows = extract_fumble_data(df_plays, play, idx,
                                                     main_action_patterns,
                                                     main_cleaning_method)
      df_before = df_plays.iloc[:idx]
      df_after = df_plays.iloc[idx+1:]
      df_plays = pd.concat([df_before, df_replacement_rows, df_after], ignore_index=True)
      index_of_last_added_row = idx + len(df_replacement_rows) - 1
      if df_pass_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_pass_plays(df_plays, index_of_last_added_row + 1)

    ###########
    # OFFENSE #
    ###########

    # NOTE:
    # - Incomplete passes will have 'PlayOutcome' as 'Pass Incomplete' as well
    #   as yardage value being 0.0

    # Yardage gained
    yardage = re.findall(yardage_gained, play)
    if len(yardage) > 0:
      df_plays.loc[idx, 'Yardage'] = int(yardage[0])
    else:
      df_plays.loc[idx, 'Yardage'] = 0

    # Formation
    Formation = re.findall(formation, play)
    if len(Formation) > 0:
      if Formation[0] == 'Aborted':
        pass
      else:
        df_plays.loc[idx, 'Formation'] = Formation[0]

    # Passer (What about spikes?)
    passer_name = re.findall(passer_name_pattern, play)
    if len(passer_name) > 0:
      df_plays.loc[idx, 'Passer'] = passer_name[0]

    receiver_name_and_passing_details = re.findall(receiver_pattern, play)
    if len(receiver_name_and_passing_details) > 0:
      df_plays.loc[idx, 'Direction'] = f"{receiver_name_and_passing_details[0][0]} {receiver_name_and_passing_details[0][1]}"
      df_plays.loc[idx, 'Receiver'] = receiver_name_and_passing_details[0][2]

    # Unique situation (offense spikes the ball)
    if play.find('spike') != -1:
      df_plays.loc[idx, 'Direction'] = 'spiked' # Direction?


    #############
    #  DEFENSE  #
    #############

    # Difference between ", " and "; " separating tacklers
    # ', ' - both defenders worked together to make the tackle
    # "; " - first defender initiated hit and second finished
    # - Should I mark the differences?

    tackler_1 = re.findall(defense_tackler_1_name_pattern, play) # tackler #1 (Could be solo or the one who initiated the hit)
    if len(tackler_1) > 0:
      df_plays.loc[idx, 'TackleBy1'] = tackler_1[0]

    tackler_2 = re.findall(defense_tackler_2_name_pattern, play) # tackler #2 (equally contributed or assisted with tackle)
    if len(tackler_2) > 0:
      df_plays.loc[idx, 'TackleBy2'] = tackler_2[0]

    pressure_by = re.findall(defense_pressure_name_pattern, play) # defender who applied pressure to the passer
    if len(pressure_by) > 0:
      df_plays.loc[idx, 'PressureBy'] = pressure_by[0]

    ##############
    #  INJURIES  #
    ##############

    injuries = re.findall(injury, play)
    if len(injuries) > 0:
      df_plays.at[idx, 'InjuredPlayers'] = injuries

    #############
    #  PENALTY  #
    #############

    # Accepted Penalty
    if play.find('PENALTY') != -1:
      play_elements = play.split(". ")
      penalties = []
      for i in play_elements:
        if i.find('PENALTY') != -1:
          penalties.append(i)
      df_plays.at[idx, 'AcceptedPenalty'] = penalties

    # Declined Penalty
    if play.find('Penalty') != -1:
      play_elements = play.split(". ")
      penalties = []
      for i in play_elements:
        if i.find('Penalty') != -1:
          penalties.append(i)
      df_plays.at[idx, 'DeclinedPenalty'] = penalties

  if df_pass_plays.tail(1).index.tolist()[0] == idx:
    return df_plays

### RUN PLAYS

In [None]:
# PURPOSE:
# - Clean run play types
# INPUT PARAMETERS:
# df_plays    - dataframe - dataframe of plays
# index_start -  integer  - the starting index of the associated input dataframe
#                           to begin cleaning.
# RETURN:
# df_plays - dataframe - dataframe of plays that now has all useful run play
#                        data accessable and clean.

# NOTE:
# - Need to comment on how this is also a method being used for
#   1. fumble recoveries for yardage
#   2. fumble recoveries for touchdown
# - I also have not come across a case where a rushing play has been fumbled and someone
#   recovered the ball and scored a touchdown yet.

def clean_run_plays(df_plays, index_start = None):

  if index_start != None:
    df_plays_adjusted = df_plays.iloc[df_plays.index.tolist().index(index_start):]
    df_run_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('Run')]
  else:
    df_run_plays = df_plays[df_plays['PlayOutcome'].str.contains('Run')]

  # Iterating through every run play within 'df_run_plays'
  for idx, play in df_run_plays['PlayDescription'].items():

    ################
    # Play details #
    ################

    # Play Type
    df_plays.loc[idx, 'PlayType'] = 'Run'

    # TimeOnTheClock
    TimeOnTheClock = re.findall(time_on_clock_pattern, play)
    if len(TimeOnTheClock) > 0:
      df_plays.loc[idx, 'TimeOnTheClock'] = TimeOnTheClock[0]

    # Formation
    Formation = re.findall(formation, play)
    if len(Formation) > 0:
      if Formation[0] == 'Aborted':
        pass
      else:
        df_plays.loc[idx, 'Formation'] = Formation[0]

    ############
    # REVERSES #
    ############

    # In 'PlayDescription' all information before the "reversed" sentence is not needed.
    # - All information before is stored within 'ReverseDetails' and the remaining is cleaned.
    if play.find('REVERSED') != -1:
      play_elements = play.split(". ")
      for i in play_elements:
        if i.find("REVERSED") != -1:
          df_plays.at[idx, 'ReverseDetails'] = play_elements[:play_elements.index(i) + 1]
          play = ". ".join(play_elements[play_elements.index(i) + 1:])
          break

    ############################
    # REPORTING IN AS ELIGIBLE #
    ############################

    # I do not think this contains any useful data so I am going to exclude it.
    if play.find('reported in as eligible') != -1:
      play_elements = play.split(". ")
      for i in play_elements:
        if i.find('reported in as eligible') != -1:
          play = ". ".join(play_elements[play_elements.index(i) + 1:])
          break

    ###########
    # FUMBLES #
    ###########

    if play.find('FUMBLES') != -1:

      # I think it would help to comment on each action added
      # Does this catch fumble recovery touchdowns?
      main_action_patterns = [rusher_pattern, aborted_fumble_pattern, qb_fumble_pattern, defensive_takeaway_run_pattern]
      main_cleaning_method = clean_run_plays
      df_replacement_rows = extract_fumble_data(df_plays, play, idx,
                                                main_action_patterns,
                                                main_cleaning_method)
      df_before = df_plays.iloc[:idx]
      df_after = df_plays.iloc[idx+1:]
      df_plays = pd.concat([df_before, df_replacement_rows, df_after], ignore_index=True)
      index_of_last_added_row = idx + len(df_replacement_rows) - 1
      # returning row after the last index
      if df_run_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_run_plays(df_plays, index_of_last_added_row + 1)

    #############
    #  OFFENSE  #
    #############

    # Rusher
    rusher_patterns = [rusher_pattern, defensive_takeaway_run_pattern, qb_fumble_pattern, touchdown_after_takeaway_pattern]
    # Loop through patterns and find the first match
    for pattern in rusher_patterns:
      rusher = re.findall(pattern, play)
      if len(rusher) > 0:
        rusher_name = rusher[0]
        df_plays.loc[idx, 'Rusher'] = rusher_name
        break

    # Direction
    rushing_directions = ['guard', 'middle', 'tackle', 'end', 'kneels']
    for i in rushing_directions:
      if play.find(i) != -1:
        start = play.find(rusher_name) + len(rusher_name) + 1
        end = play.find(i) + len(i)
        df_plays.loc[idx, 'Direction'] = play[start:end]
        break

    # Yardage gained
    yardage = re.findall(yardage_gained, play)
    if len(yardage) > 0:
      df_plays.loc[idx, 'Yardage'] = int(yardage[0])
    else:
      df_plays.loc[idx, 'Yardage'] = 0

    #############
    #  DEFENSE  #
    #############

    tackler_1 = re.findall(defense_tackler_1_name_pattern, play) # tackler #1 (Could be solo or the one who initiated the hit)
    if len(tackler_1) > 0:
      df_plays.loc[idx, 'TackleBy1'] = tackler_1[0]
    tackler_2 = re.findall(defense_tackler_2_name_pattern, play) # tackler #2 (equally contributed or assisted with tackle)
    if len(tackler_2) > 0:
      df_plays.loc[idx, 'TackleBy2'] = tackler_2[0]

    ##############
    #  INJURIES  #
    ##############

    injuries = re.findall(injury, play)
    if len(injuries) > 0:
      df_plays.at[idx, 'InjuredPlayers'] = injuries

    #############
    #  PENALTY  #
    #############

    # Accepted Penalty
    if play.find('PENALTY') != -1:
      play_elements = play.split(". ")
      penalties = []
      for i in play_elements:
        if i.find('PENALTY') != -1:
          penalties.append(i)
      df_plays.at[idx, 'AcceptedPenalty'] = penalties

    # Declined Penalty
    if play.find('Penalty') != -1:
      play_elements = play.split(". ")
      penalties = []
      for i in play_elements:
        if i.find('Penalty') != -1:
          penalties.append(i)
      df_plays.at[idx, 'DeclinedPenalty'] = penalties

    # Return if the last play has been cleaned in 'df_run_plays'
    if df_run_plays.tail(1).index.tolist()[0] == idx:
      return df_plays

### INTERCEPTIONS

In [None]:
# PURPOSE:
# - Clean intercepted plays
# INPUT PARAMETERS:
# df_plays    - dataframe - dataframe of plays
# index_start -  integer  - the starting index of the associated input dataframe
#                           to begin cleaning.
# RETURN:
# df_plays - dataframe - dataframe of plays that now has all useful intercepted play
#                        data accessible and clean.

# ROUGH DESGIN
# 1. Narrow dataframe using 'index_start'
#    - This is a recursive method, the narrowing will get smaller and
#      smaller until all 'intercepted' type plays have been cleaned.
# 2. Grab first 'intercepted' play from narrowed dataframe
# 3. Create 2 single row dataframes.
#    a. intended play
#    b. yardage after interception
# 4. Break down play into sentences and clean
#    - Depending on the sentence within the play, will determine which
#      single row dataframe it will go to.
# 5. Combine both dataframes of cleaned data into one dataframe
# 6. Replace old play row with new cleaned multi row
# 7. return clean_interceped_plays( x , y)
#    - x = updated df_plays
#    - y = index directly after the last clean added row

# Concerns:
# ~ 1 ~
# PLAY SNIP - "(9:53) (Shotgun) D.Watson pass short left intended for E.Moore INTERCEPTED by D.Hill (Z.Carter) at CIN 30."
# - The concern here is (Z.Carter)
#   - I do not know what to categorize this player as? I believe that he had an impact on the play and could possibly be a reason
#     that D.Hill was able to intercept the ball.
# ~ 2 ~
# PLAY SNIP - "(4:16) (Shotgun) J.Allen pass deep middle intended for S.Diggs INTERCEPTED by J.Whitehead [Q.Williams] at NYJ -1. Touchback."
# - The concern here is 'touchback'
#   - I have no idea what to do with that
# ~ 3 ~
#`- I do not have anything set in play to handle fumbles? What happens if a QB fumbles, recovers, then throws an interception? -> Then player that intercepted fumbles?
# ~ 4 ~
# - There are 2 rows within this sinlge play. (Intended throwing play, yardage after interception)
#   - For both of these rows that represent a single play, they both state that the throwing team has possession
#     - I do not know how this is going to effect the future with analysis on data

def clean_intercepted_plays(df_plays, index_start = None):

  # Will cut df_plays starting from index_start (narrowing our search space)
  if index_start != None:
    df_plays_adjusted = df_plays.iloc[df_plays.index.tolist().index(index_start):]
    df_intercepted_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('Interception')]
  else:
    df_intercepted_plays = df_plays[df_plays['PlayOutcome'].str.contains('Interception')]

  # Exit case (If no more 'Interception' type plays are found)
  if df_intercepted_plays.empty:
    return df_plays

  # Retrieve the index and 'PlayDescription' of the first intercepted play in 'df_intercepted_plays'
  # - Process one play per iteration in the recursive method
  idx = df_intercepted_plays.index[0]
  play = df_plays['PlayDescription'].iloc[idx]

  ############
  # REVERSES #
  ############

  # In 'PlayDescription' all information before the "reversed" sentence is not needed.
  # - All information before is stored within 'ReverseDetails' and the remaining is cleaned.
  if play.find('REVERSED') != -1:
    play_elements = play.split(". ")
    for i in play_elements:
      if i.find("REVERSED") != -1:
        df_plays.at[idx, 'ReverseDetails'] = play_elements[:play_elements.index(i) + 1]
        play = ". ".join(play_elements[play_elements.index(i) + 1:])
        break

  # Create 2 single row dataframes.
  # 1. intended play
  df_intended_play = df_plays.iloc[idx].copy()
  df_intended_play = pd.DataFrame([df_intended_play], columns=df_plays.columns)
  df_intended_play.reset_index(drop=True, inplace=True)
  df_intended_play['PlayDescription'] = 'nan'
  # 2. yardage after interception
  df_yardage_after_interception = df_plays.iloc[idx].copy()
  df_yardage_after_interception = pd.DataFrame([df_yardage_after_interception], columns=df_plays.columns)
  df_yardage_after_interception.reset_index(drop=True, inplace=True)
  df_yardage_after_interception['PlayDescription'] = 'nan'

  # break down play by sentences.
  play_elements = play.split(". ")

  # Every sentence within 'PlayDescription' except yardage/touchdown after interception
  intended_play_data = []

  # iterate through play_elements
  for i in play_elements:

    ##############################
    # YARDAGE AFTER INTERCEPTION #
    ##############################

    yardage_after_interception = re.findall(defensive_takeaway_run_pattern, i)
    if len(yardage_after_interception) > 0:
      df_yardage_after_interception['PlayDescription'] = i

      # Player running after interception
      df_yardage_after_interception.loc[0, 'Rusher'] = yardage_after_interception[0]

      # Playtype?
      # - Should this be a new playtype? Something like "RunAfterInterception"?

      # Yardage gained
      yardage = re.findall(yardage_gained, i)
      if len(yardage) > 0:
        df_yardage_after_interception.loc[0, 'Yardage'] = int(yardage[0])
      else:
        df_yardage_after_interception.loc[0, 'Yardage'] = 0

      # Who made tackle
      tackler = re.findall(defense_tackler_1_name_pattern, i)
      if len(tackler) > 0:
        df_yardage_after_interception.loc[0, 'TackleBy1'] = tackler[0]

      continue

    ################################
    # TOUCHDOWN AFTER INTERCEPTION #
    ################################

    touchdown_after_interception_check = re.findall(touchdown_after_takeaway_pattern, i)
    if len(touchdown_after_interception_check) > 0:
      df_yardage_after_interception['PlayDescription'] = i

      # Player running after interception
      df_yardage_after_interception.loc[0, 'Rusher'] = touchdown_after_interception_check[0]

      # Yardage gained
      yardage = re.findall(yardage_gained, i)
      if len(yardage) > 0:
        df_yardage_after_interception.loc[0, 'Yardage'] = int(yardage[0])

      # PlayOutcome
      df_yardage_after_interception.loc[0, 'PlayOutcome'] = 'Touchdown'

      # IsScoringPlay
      df_yardage_after_interception.loc[0, 'IsScoringPlay'] = 1

      continue

    intended_play_data.append(i)

  #################
  # INTENDED PLAY #
  #################

  intended_play_playdescription = ". ".join(intended_play_data)

  df_intended_play['PlayDescription'] = intended_play_playdescription

  df_intended_play.loc[0, 'PlayOutcome'] = 'Pass'
  df_intended_play = clean_pass_plays(df_intended_play)
  df_intended_play.loc[0, 'PlayOutcome'] =  df_plays['PlayOutcome'].iloc[idx]

  # Intercepted by
  intercepted_by = re.findall(interception_name_pattern, intended_play_playdescription)
  if len(intercepted_by) > 0:
    df_intended_play.loc[0, 'InterceptedBy'] = intercepted_by[0]


  #############################
  # NEW REPLACEMENT DATAFRAME #
  #############################

  # combine both single row dataframes into one
  if df_yardage_after_interception['PlayDescription'].iloc[0] == 'nan':
    df_cleaned_replacement = df_intended_play
  else:
    df_cleaned_replacement = pd.concat([df_intended_play, df_yardage_after_interception], ignore_index=True)

  # Replace old row with new cleaned dataframe
  df_before_row = df_plays.iloc[:idx]
  df_after_row = df_plays.iloc[idx+1:]
  df_plays = pd.concat([df_before_row, df_cleaned_replacement, df_after_row], ignore_index=True)


  # If this is the last play in the dataset
  if df_intercepted_plays.tail(1).index.tolist()[0] == idx:
    return df_plays
  else:
    return clean_intercepted_plays(df_plays, idx+len(df_cleaned_replacement))

### SACKS


In [None]:
def clean_sacked_plays(df_plays, index_start = None):

  # Will cut df_plays starting from index_start (narrowing our search space)
  if index_start != None:
    df_plays_adjusted = df_plays.iloc[df_plays.index.tolist().index(index_start):]
    df_sacked_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('Sack')]
  else:
    df_sacked_plays = df_plays[df_plays['PlayOutcome'].str.contains('Sack')]

  for idx, play in df_sacked_plays['PlayDescription'].items():

    ################
    # Play details #
    ################

    # TimeOnTheClock
    TimeOnTheClock = re.findall(time_on_clock_pattern, play)
    if len(TimeOnTheClock) > 0:
      df_plays.loc[idx, 'TimeOnTheClock'] = TimeOnTheClock[0]

    ###########
    # FUMBLES #
    ###########

    if play.find('FUMBLES') != -1:

      main_action_patterns = [passer_name_pattern, defensive_takeaway_run_pattern, touchdown_after_takeaway_pattern]
      main_cleaning_method = clean_sacked_plays
      df_replacement_rows = extract_fumble_data(df_plays, play, idx,
                                                main_action_patterns,
                                                main_cleaning_method)
      df_before = df_plays.iloc[:idx]
      df_after = df_plays.iloc[idx+1:]
      df_plays = pd.concat([df_before, df_replacement_rows, df_after], ignore_index=True)
      index_of_last_added_row = idx + len(df_replacement_rows) - 1
      # returning row after the last index
      if df_sacked_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_sacked_plays(df_plays, index_of_last_added_row + 1)

    #############
    #  OFFENSE  #
    #############

    # Formation
    Formation = re.findall(formation, play)
    if len(Formation) > 0:
      if Formation[0] == 'Aborted':
        pass
      else:
        df_plays.loc[idx, 'Formation'] = Formation[0]

    # Sacked Passer
    sacked_passer_name = re.findall(passer_name_pattern, play)
    if len(sacked_passer_name) > 0:
      df_plays.loc[idx, 'Passer'] = sacked_passer_name[0]

    # Yardage lost
    yardage = re.findall(yardage_from_sack, play)
    if len(yardage) > 0:
      df_plays.loc[idx, 'Yardage'] = int(yardage[0])

    #############
    #  DEFENSE  #
    #############

    # Solo sack (One person sacked the passer)
    solo_sack = re.findall(defense_tackler_1_name_pattern, play)
    if len(solo_sack) > 0:
      df_plays.loc[idx, 'SackedBy'] = solo_sack[0]

    # Split sack (A sack was given to the passer by multiple defenders)
    split_sack = re.findall(split_sack_pattern, play)
    if len(split_sack) > 0:
      df_plays.at[idx, 'SackedBy'] = list(split_sack[0])

    ##############
    #  INJURIES  #
    ##############

    injuries = re.findall(injury, play)
    if len(injuries) > 0:
      df_plays.at[idx, 'InjuredPlayers'] = injuries

    #############
    #  PENALTY  #
    #############

    # Accepted Penalty
    if play.find('PENALTY') != -1:
      play_elements = play.split(". ")
      penalties = []
      for i in play_elements:
        if i.find('PENALTY') != -1:
          penalties.append(i)
      df_plays.at[idx, 'AcceptedPenalty'] = penalties

    # Declined Penalty
    if play.find('Penalty') != -1:
      play_elements = play.split(". ")
      penalties = []
      for i in play_elements:
        if i.find('Penalty') != -1:
          penalties.append(i)
      df_plays.at[idx, 'DeclinedPenalty'] = penalties

    if df_sacked_plays.tail(1).index.tolist()[0] == idx:
      return df_plays

### PUNTS

In [93]:
# A punt playtype will be split into 2 or more rows
#   1. The Punt
#      - 'PlayType'
#         - Punt
#      - 'Punter'
#      - 'LongSnapper'
#   2. The Punt Return
#      - 'PlayType'
#         - Punt Return
#      - 'PlayOutcome'
#         - x yard punt return
#         - fair catch
#         - touchback
#         - out of bounds
#         - downed
#      - 'Returner'
#      - 'Receiver'
#      - 'Yardage'
#      - 'TackleBy1'
#      - 'TackleBy2'
#      - 'DownedBy'

# I need to figure out a fake punt
# I need to figure out a punt that has been blocked
# I need to figure out what to do when a fumble happens
# I need to figure out what to do when a touchdown happens
# Maybe in the future, to make this more space friendly, I can combine features
# - Such as 'Punter' & 'LongSnapper' OR 'TackleBy1' & 'DownedBy'
#   OR 'Returner' & 'Receiver'

def clean_punt_plays(df_plays, index_start = None):

  # Will cut df_plays starting from index_start (narrowing our search space)
  if index_start != None:
    df_plays_adjusted = df_plays.iloc[df_plays.index.tolist().index(index_start):]
    df_punt_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('Punt')]
  else:
    df_punt_plays = df_plays[df_plays['PlayOutcome'].str.contains('Punt')]

  if df_punt_plays.empty:
    return df_plays

  # Retrieve the index and 'PlayDescription' of the first punt play in 'df_punt_plays'
  # - Process one play per iteration in the recursive method
  idx = df_punt_plays.index[0]
  play = df_plays['PlayDescription'].iloc[idx]

  ############
  # REVERSES #
  ############

  # In 'PlayDescription' all information before the "reversed" sentence is not needed.
  # - All information before is stored within 'ReverseDetails' and the remaining is cleaned.
  if play.find('REVERSED') != -1:
    play_elements = play.split(". ")
    for i in play_elements:
      if i.find("REVERSED") != -1:
        df_plays.at[idx, 'ReverseDetails'] = play_elements[:play_elements.index(i) + 1]
        play = ". ".join(play_elements[play_elements.index(i) + 1:])
        break

  # Create 2 single row dataframes.
  # 1. The Punt
  df_punt = df_plays.iloc[idx].copy()
  df_punt = pd.DataFrame([df_punt], columns=df_plays.columns)
  df_punt.reset_index(drop=True, inplace=True)
  df_punt['PlayDescription'] = 'nan'
  # 2. The Punt Return
  df_punt_return = df_plays.iloc[idx].copy()
  df_punt_return = pd.DataFrame([df_punt_return], columns=df_plays.columns)
  df_punt_return.reset_index(drop=True, inplace=True)
  df_punt_return['PlayDescription'] = 'nan'

  #############
  # PLAY TIME #
  #############

  time = re.findall(time_on_clock_pattern, play)
  if len(time) > 0:
    df_punt.loc[0, 'TimeOnTheClock'] = time[0]

  # break down play by sentences.
  play_elements = play.split(". ")

  accepted_penalties = []
  declined_penalties = []

  for i in play_elements:

    ########
    # PUNT #
    ########

    # All data needed for first row in replacement dataframe
    punt = re.findall(punting_pattern, i)
    if len(punt) > 0:
      df_punt.loc[0, 'PlayType'] = 'Punt'
      df_punt.loc[0, 'PlayDescription'] = i
      df_punt.loc[0, 'Kicker'] = punt[0][0]
      df_punt.loc[0, 'Yardage'] = int(punt[0][1])
      df_punt.loc[0, 'LongSnapper'] = punt[0][2]
      # Touchback
      if i.find('Touchback') != -1:
        df_punt.loc[0, 'PlayOutcome'] = 'Touchback'
        continue
      # Out of bounds
      if i.find('out of bounds') != -1:
        df_punt.loc[0, 'PlayOutcome'] = 'out of bounds'
        continue
      # Downed by
      if i.find('downed by') != -1:
        df_punt.loc[0, 'PlayOutcome'] = 'downed'
        downed_by = re.findall(kick_downed_by_pattern, i)
        df_punt.loc[0, 'DownedBy'] = downed_by[0][downed_by[0].find("-")+1:] # Need to get abreviation of team name away from player name (e.g. IND-G.Stuard)
        continue
      # fair catch
      if i.find('fair catch') != -1:
        df_punt.loc[0, 'PlayOutcome'] = 'fair catch'
        fair_catch_by = re.findall(punt_fair_catch_pattern, i)
        df_punt.loc[0, 'Returner'] = fair_catch_by[0]
        continue
      continue

    ###############
    # PUNT RETURN #
    ###############

    # All data needed for the second row within replacement dataframe
    # - Second row only needed when there is a punt return for yardage
    punt_return = re.findall(punt_return_pattern, i)
    if len(punt_return) > 0:
      df_punt_return.loc[0, 'PlayDescription'] = i
      df_punt_return.loc[0, 'PlayOutcome'] = 'Punt Return' # <- Maybe change this to something like "x yard punt return"
      df_punt_return.loc[0, 'PlayType'] = 'Punt Return'
      df_punt_return.loc[0, 'Returner'] = punt_return[0][0]
      df_punt_return.loc[0, 'Yardage'] = int(punt_return[0][1])
      df_punt_return.loc[0, 'TackleBy1'] = punt_return[0][2]
      if punt_return[0][3] != '':
        df_punt_return.loc[0, 'TackleBy2'] = punt_return[0][3]

    ###############################
    # TOUCHDOWN AFTER PUNT RETURN #
    ###############################

    touchdown_after_punt_check = re.findall(touchdown_after_takeaway_pattern, i)
    if len(touchdown_after_punt_check) > 0:
      df_punt_return.loc[0, 'PlayDescription'] = i

      # Player running after interception
      df_punt_return.loc[0, 'Rusher'] = touchdown_after_punt_check[0]

      # Yardage gained
      yardage = re.findall(yardage_gained, i)
      if len(yardage) > 0:
        df_punt_return.loc[0, 'Yardage'] = int(yardage[0])

      # PlayOutcome
      df_punt_return.loc[0, 'PlayOutcome'] = 'Touchdown'

      # IsScoringPlay
      df_punt_return.loc[0, 'IsScoringPlay'] = 1

    #############
    #  PENALTY  #
    #############

    # Accepted Penalty
    if i.find('PENALTY') != -1:
      accepted_penalties.append(i)

    # Declined Penalty
    if i.find('Penalty') != -1:
      declined_penalties.append(i)

    # If playoutcome is the same as the original play, then run second sentence through
    # run cleaning method.

  if len(accepted_penalties) > 0:
    df_punt.at[0, 'AcceptedPenalty'] = accepted_penalties
  if len(declined_penalties) > 0:
    df_punt.at[0, 'DeclinedPenalty'] = declined_penalties

  #############################
  # NEW REPLACEMENT DATAFRAME #
  #############################

  if df_punt_return['PlayDescription'].iloc[0] == 'nan':
    df_replacement_rows = df_punt
  else:
    df_replacement_rows = pd.concat([df_punt, df_punt_return], ignore_index=True)

  df_before_row = df_plays.iloc[:idx]
  df_after_row = df_plays.iloc[idx+1:]
  df_plays = pd.concat([df_before_row, df_replacement_rows, df_after_row], ignore_index=True)

  if df_punt_plays.tail(1).index.tolist()[0] == idx:
    return df_plays
  else:
    return clean_punt_plays(df_plays, idx+len(df_replacement_rows))

### KICKOFFS

In [103]:
# A kickoff playtype will be split into 1 or more rows

#
# I need to figure out an onside kick (recovered by kicking team)
# I need to figure out fumbled returns
# I need to figure out returns for a touchdown

def clean_kickoff_plays(df_plays, index_start = None):

  # Will cut df_plays starting from index_start (narrowing our search space)
  if index_start != None:
    df_plays_adjusted = df_plays.iloc[df_plays.index.tolist().index(index_start):]
    df_kickoff_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('kickoff', case=False)]
  else:
    df_kickoff_plays = df_plays[df_plays['PlayOutcome'].str.contains('kickoff', case=False)]

  if df_kickoff_plays.empty:
    return df_plays

  # Retrieve the index and 'PlayDescription' of the first kickoff play in 'df_kickoff_plays'
  # - Process one play per iteration in the recursive method
  idx = df_kickoff_plays.index[0]
  play = df_plays['PlayDescription'].iloc[idx]
  # print(idx)

  ############
  # REVERSES #
  ############

  # In 'PlayDescription' all information before the "reversed" sentence is not needed.
  # - All information before is stored within 'ReverseDetails' and the remaining is cleaned.
  if play.find('REVERSED') != -1:
    play_elements = play.split(". ")
    for i in play_elements:
      if i.find("REVERSED") != -1:
        df_plays.at[idx, 'ReverseDetails'] = play_elements[:play_elements.index(i) + 1]
        play = ". ".join(play_elements[play_elements.index(i) + 1:])
        break

  # Create 2 single row dataframes.
  # 1. The Kickoff
  df_kickoff = df_plays.iloc[idx].copy()
  df_kickoff = pd.DataFrame([df_kickoff], columns=df_plays.columns)
  df_kickoff.reset_index(drop=True, inplace=True)
  df_kickoff['PlayDescription'] = 'nan'
  # 2. The Kickoff Return
  df_kickoff_return = df_plays.iloc[idx].copy()
  df_kickoff_return = pd.DataFrame([df_kickoff_return], columns=df_plays.columns)
  df_kickoff_return.reset_index(drop=True, inplace=True)
  df_kickoff_return['PlayDescription'] = 'nan'

  # break down play by sentences.
  play_elements = play.split(". ")

  accepted_penalties = []
  declined_penalties = []

  for i in play_elements:

    ###########
    # KICKOFF #
    ###########

    kickoff = re.findall(kickoff_pattern, i)
    if len(kickoff) > 0:
      df_kickoff.loc[0, 'PlayDescription'] = i
      df_kickoff.loc[0, 'PlayType'] = 'Kickoff'
      # print(kickoff)
      df_kickoff.loc[0, 'Kicker'] = kickoff[0][0]
      df_kickoff.loc[0, 'Yardage'] = int(kickoff[0][1])
      if i.find('Touchback') != -1:
        df_kickoff.loc[0, 'PlayOutcome'] = 'Touchback'
        continue
      # I need to figure out what the difference will be when the kicking team recovers
      if i.find('onside') != -1:
        df_kickoff.loc[0, 'PlayOutcome'] = 'onside'
        downed_by = re.findall(kick_downed_by_pattern, i)
        if len(downed_by) > 0:
          df_kickoff.loc[0, 'DownedBy'] = downed_by[0][downed_by[0].find("-")+1:]
        continue
      continue

    ##################
    # KICKOFF RETURN #
    ##################

    #(228, [242])
    # C.Santos kicks onside 9 yards from CHI 35 to CHI 44
    # J.Reed (didn't try to advance) to CHI 44 for no gain.
    kick_return = re.findall(kick_return_pattern, i)
    if len(kick_return) > 0:
      df_kickoff_return.loc[0, 'PlayDescription'] = i
      df_kickoff_return.loc[0, 'PlayType'] = 'Kickoff Return'
      df_kickoff_return.loc[0, 'Returner'] = kick_return[0][0]
      df_kickoff_return.loc[0, 'Yardage'] = int(kick_return[0][1])
      df_kickoff_return.loc[0, 'TackleBy1'] = kick_return[0][2]
      if kick_return[0][3] != '':
        df_kickoff_return.loc[0, 'TackleBy2'] = kick_return[0][3]

    #############
    #  PENALTY  #
    #############

    # Accepted Penalty
    if i.find('PENALTY') != -1:
      accepted_penalties.append(i)

    # Declined Penalty
    if i.find('Penalty') != -1:
      declined_penalties.append(i)

    # If playoutcome is the same as the original play, then run second sentence through
    # run cleaning method.

  if len(accepted_penalties) > 0:
    df_kickoff.at[0, 'AcceptedPenalty'] = accepted_penalties
  if len(declined_penalties) > 0:
    df_kickoff.at[0, 'DeclinedPenalty'] = declined_penalties

  #############################
  # NEW REPLACEMENT DATAFRAME #
  #############################

  if df_kickoff_return['PlayDescription'].iloc[0] == 'nan':
    df_replacement_rows = df_kickoff
  else:
    df_replacement_rows = pd.concat([df_kickoff, df_kickoff_return], ignore_index=True)

  df_before_row = df_plays.iloc[:idx]
  df_after_row = df_plays.iloc[idx+1:]
  df_plays = pd.concat([df_before_row, df_replacement_rows, df_after_row], ignore_index=True)

  if df_kickoff_plays.tail(1).index.tolist()[0] == idx:
    return df_plays
  else:
    return clean_kickoff_plays(df_plays, idx+len(df_replacement_rows))

### TOUCHDOWN PLAYS

In [None]:
def clean_touchdown_plays(df_plays, index_start=None):

  # Cut 'df_plays' to begin from 'index_start' to the last touchdown play available in dataframe
  if index_start != None:
    df_plays_adjusted = df_plays.iloc[df_plays.index.tolist().index(index_start):]
    df_touchdown_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('Touchdown')]
  else:
    df_touchdown_plays = df_plays[df_plays['PlayOutcome'].str.contains('Touchdown')]

  # Iterating through every touchdown play within 'df_touchdown_plays'
  for idx, play in df_touchdown_plays['PlayDescription'].items():

    # - Once i figure out what kind of touchdown it was, then I will be able to
    #   determine the 'PlayType'

    ##########################
    # PUNT RETURN TOUCHDOWNS #
    ##########################

    punt_play = re.findall(punting_pattern, play)
    if len(punt_play) > 0:

      # creating a copy of the punt touchdown play and cleaning the copy
      punt_touchdown_row = df_plays.iloc[idx].copy()
      punt_touchdown_row['PlayOutcome'] = 'Punt'
      punt_touchdown_row['IsScoringPlay'] = 0 # This will only be the value for the team that punted the ball
      punt_touchdown_row = pd.DataFrame([punt_touchdown_row], columns=df_plays.columns)
      punt_touchdown_row.reset_index(drop=True, inplace=True)
      cleaned_punt_touchdown_row = clean_punt_plays(punt_touchdown_row)

      # Replacing old row with cleaned row
      df_before_row = df_plays.iloc[:idx]
      df_after_row = df_plays.iloc[idx+1:]
      df_plays = pd.concat([df_before_row, cleaned_punt_touchdown_row, df_after_row], ignore_index=True)

      # Recursion to update 'df_plays'
      if df_touchdown_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_touchdown_plays(df_plays, idx+len(cleaned_punt_touchdown_row))

    #####################################
    # SACKED FUMBLE RECOVERY TOUCHDOWNS #
    #####################################

    # sacked_play = re.findall(sacked_passer_name_pattern, play)
    # if len(sacked_play) > 0:
    if play.find("sacked") != -1:

      # creating a copy of the sack touchdown play and cleaning the copy
      sacked_touchdown_row = df_plays.iloc[idx].copy()
      sacked_touchdown_row['PlayOutcome'] = 'Sack'
      sacked_touchdown_row['IsScoringPlay'] = 0
      sacked_touchdown_row = pd.DataFrame([sacked_touchdown_row], columns=df_plays.columns)
      sacked_touchdown_row.reset_index(drop=True, inplace=True)
      cleaned_sacked_touchdown_row = clean_sacked_plays(sacked_touchdown_row)

      # Mark last added row as touchdown
      cleaned_sacked_touchdown_row.loc[cleaned_sacked_touchdown_row.index[-1], 'PlayOutcome'] = 'Touchdown'
      cleaned_sacked_touchdown_row.loc[cleaned_sacked_touchdown_row.index[-1], 'IsScoringPlay'] = 1

      # Replacing old row with cleaned row (Original row can sometimes be replaced with multiple rows)
      df_before_row = df_plays.iloc[:idx]
      df_after_row = df_plays.iloc[idx+1:]
      df_plays = pd.concat([df_before_row, cleaned_sacked_touchdown_row, df_after_row], ignore_index=True)

      # Recursion to update 'df_plays'
      if df_touchdown_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_touchdown_plays(df_plays, idx+len(cleaned_sacked_touchdown_row))

    ##########################
    # INTERCEPTED TOUCHDOWNS #
    ##########################

    # Still need to clean intercepted play types
    if play.find("INTERCEPTED") != -1:

      # creating a copy of the incercepted touchdown play and cleaning the copy
      intercepted_touchdown_row = df_plays.iloc[idx].copy()
      intercepted_touchdown_row['PlayOutcome'] = 'Interception'
      intercepted_touchdown_row['IsScoringPlay'] = 0 # This will only be the value for the team that threw the interception
      intercepted_touchdown_row = pd.DataFrame([intercepted_touchdown_row], columns=df_plays.columns)
      intercepted_touchdown_row.reset_index(drop=True, inplace=True)
      cleaned_intercepted_touchdown_row = clean_intercepted_plays(intercepted_touchdown_row)

      # Replacing old row with cleaned row
      df_before_row = df_plays.iloc[:idx]
      df_after_row = df_plays.iloc[idx+1:]
      df_plays = pd.concat([df_before_row, cleaned_intercepted_touchdown_row, df_after_row], ignore_index=True)

      # Recursion to update 'df_plays'
      if df_touchdown_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_touchdown_plays(df_plays, idx+len(cleaned_intercepted_touchdown_row))

    ######################
    # PASSING TOUCHDOWNS #
    ######################

    # If a play has a passer throwing the ball, I am assuming it is a passing play
    passing_play = re.findall(passer_name_pattern, play)
    if len(passing_play) > 0 and play.find("sacked") == -1:

      # creating a copy of the passing touchdown play row and cleaning the copy
      passing_touchdown_row = df_plays.iloc[idx].copy()
      passing_touchdown_row['PlayType'] = 'Pass'
      passing_touchdown_row['PlayOutcome'] = 'Pass'
      passing_touchdown_row['IsScoringPlay'] = 1
      passing_touchdown_row = pd.DataFrame([passing_touchdown_row], columns=df_plays.columns)
      cleaned_passing_touchdown_row = clean_pass_plays(passing_touchdown_row)
      cleaned_passing_touchdown_row['PlayOutcome'] = df_plays['PlayOutcome'].iloc[idx]

      # Replacing old row with cleaned row
      df_before_row = df_plays.iloc[:idx]
      df_after_row = df_plays.iloc[idx+1:]
      df_plays = pd.concat([df_before_row, cleaned_passing_touchdown_row, df_after_row], ignore_index=True)

      # Recursion to update 'df_plays'
      if df_touchdown_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_touchdown_plays(df_plays, idx+1)

    ######################
    # RUSHING TOUCHDOWNS #
    ######################

    # Rusher
    rusher_patterns = [rusher_pattern, defensive_takeaway_run_pattern]
    # Loop through patterns and find the first match
    for pattern in rusher_patterns:
      rusher = re.findall(pattern, play)
      if len(rusher) > 0:
        # creating a copy of the rushing touchdown play row and cleaning the copy
        rushing_touchdown_row = df_plays.iloc[idx].copy()
        rushing_touchdown_row['PlayType'] = 'Run'
        rushing_touchdown_row['PlayOutcome'] = 'Run'
        rushing_touchdown_row['IsScoringPlay'] = 1
        rushing_touchdown_row = pd.DataFrame([rushing_touchdown_row], columns=df_plays.columns)
        cleaned_rushing_touchdown_row = clean_run_plays(rushing_touchdown_row)
        cleaned_rushing_touchdown_row['PlayOutcome'] = df_plays['PlayOutcome'].iloc[idx]

        # Replacing old row with cleaned row
        df_before_row = df_plays.iloc[:idx]
        df_after_row = df_plays.iloc[idx+1:]
        df_plays = pd.concat([df_before_row, cleaned_rushing_touchdown_row, df_after_row], ignore_index=True)

        # Recursion to update 'df_plays'
        if df_touchdown_plays.tail(1).index.tolist()[0] == idx:
          return df_plays
        else:
          return clean_touchdown_plays(df_plays, idx+1)

## 3. PIPELINE MAIN METHOD

In [74]:
# PURPOSE:
# - Accept a dataframe of plays (dataframes formatted by NFL_Scrapers) and
#   return a cleaned dataframe of those plays.
# INPUT PARAMTERS:
# df_all_plays         - dataframe - all plays in raw form from NFL_Scraper that user
#                                    would like to clean.
# OUTPUT:
# df_all_plays_cleaned - dataframe - all plays from 'df_all_plays' cleaned and data
#                                    dispersed into individual new features.

# CURRENT DESIGN PLAN:
# 1. Use uniquely designed methods for each play type to clean within dataframe
#    - (e.g. pass, run, touchdown, punt, sack, ... )
# 2. Repeat until all plays within dataframe have been cleaned.
#   NOTE:
#   - It is important to fully clean a play type before moving to the next
#      because sometimes cleaning could involve adding a new row to the dataframe,
#      causing a reset to the dataframes indexing.
#      - If we were to separate all play types from the beginning, the indexes
#        could shift around causing, for example, an index that might originally
#        point to a run play to now instead point at a pass play.

# NOTES:
# - I think "PlayOutcomes" is what determines the yardage gained on an intended play?
#   - This does not seem right to me.
#   - EXAMPLE:
#     - (9:54) Bre.Hall left end to BUF 22 for -1 yards (G.Rousseau)
#       FUMBLES (G.Rousseau), ball out of bounds at BUF 25.
#       - I would think that Bre.Hall would get docked -1 yards for his run.
#         - But I believe that he is actually docked -4
#           - 'PlayStart' = 2nd & 9 at BUF 21
#           - The play ends at BUF 25
#             - In my opinion and how I am going to track yardage is based on
#               possession of the ball. So I will track this as -1 yard not -4.

def clean_dataframe_of_plays(df_all_plays):

  ###########################
  # NEW COLUMN DESCRIPTIONS #
  ###########################

  # PlayType           - The type of play (e.g. pass/run)
  # TimeOnTheClock     - The time that was on the clock when the play started
  # Formation          - Play formation
  # Passer             - Player that threw the ball (mostly the quarterback)
  # Rusher             - Player that ran the ball (mostly the runningback)
  # Receiver           - Player on the same team as the passer that caught the ball
  # PassType           - Whether the pass was a deep or short pass?
  # Direction          - Where the ball is going during the play
  # Yardage            - Yards gained during the play
  # TackleBy1          - Main tackler on the play (could be solo or could be with someone else)
  # TackleBy2          - Assisted tackler1
  # PressureBy         - Defender that applied pressure to the passer
  # InterceptedBy      - Defender that intercepted the passing play
  # FumbleDetails      - A list that has what happened after the fumble
  #                      - [forced fumble by, recovered by, yards gained, tackled by]
  # ReverseDetails     - A list having plays leading up to play reversal
  # InjuredPlayers     - Players that were injured during the play
  # PenaltyDescription - If there is a penalty, gives a description of it
  #                      - [who caused the penalty, what was the penalty, yards lost if penalty accepted]

  new_columns = ["PlayType", "TimeOnTheClock", "Formation", "Passer", "Rusher", "Receiver", "Direction", "Yardage",
                "TackleBy1", "TackleBy2", "PressureBy", "InterceptedBy", "SackedBy", "ForcedFumbleBy",
                "FumbleDetails", "ReverseDetails",
                "InjuredPlayers", "AcceptedPenalty", "DeclinedPenalty",
                "Kicker", "LongSnapper", "Returner", "DownedBy"]

  string_columns = ["PlayType", "TimeOnTheClock", "Formation", "Passer", "Rusher", "Receiver", "Direction",
                    "TackleBy1", "TackleBy2", "PressureBy", "InterceptedBy", "SackedBy", "ForcedFumbleBy",
                    "FumbleDetails", "ReverseDetails",
                    "InjuredPlayers", "AcceptedPenalty", "DeclinedPenalty",
                    "Kicker", "LongSnapper", "Returner", "DownedBy"]

  int_columns = ["Yardage"]

  ########################################
  # RETURN DATAFRAME WITH ADDED FEATURES #
  ########################################

  df_all_plays_cleaned = df_all_plays.copy()
  df_all_plays_cleaned = df_all_plays_cleaned.reindex(columns=df_all_plays_cleaned.columns.tolist() + new_columns)
  df_all_plays_cleaned[string_columns] = df_all_plays_cleaned[string_columns].astype(str)
  df_all_plays_cleaned[int_columns] = df_all_plays_cleaned[int_columns].astype(float)

  ########################################
  # GETTING PLAY CATEGORIES AND CLEANING #
  ########################################

  df_all_plays_cleaned = clean_run_plays(df_all_plays_cleaned)
  df_all_plays_cleaned = clean_pass_plays(df_all_plays_cleaned)
  df_all_plays_cleaned = clean_intercepted_plays(df_all_plays_cleaned)
  df_all_plays_cleaned = clean_sacked_plays(df_all_plays_cleaned)
  df_all_plays_cleaned = clean_punt_plays(df_all_plays_cleaned)
  df_all_plays_cleaned = clean_kickoff_plays(df_all_plays_cleaned)
  df_all_plays_cleaned = clean_touchdown_plays(df_all_plays_cleaned)


  return df_all_plays_cleaned

# TESTING (Helper Methods)

In [None]:
# PURPOSE:
# - A tool that can be used to compare original plays and their cleaned versions

# I would like to return a map that has:
# KEY: index of original unclean play
# VALUE: index(es) of cleaned play

def unclean_clean_matches(df_unclean_plays, df_clean_plays):

  my_map = {}

  # This group of features is unique to each play
  # - Both the unclean and cleaned versions of the plays have these
  # - These features will be used to find the matching plays between the unclean df and the cleaned df
  matching_features = ['Season', 'Week', 'Date', 'AwayTeam', 'HomeTeam', 'Quarter', 'DriveNumber', 'TeamWithPossession', 'PlayNumberInDrive']

  # Iterate through each row of the dataframe of unclean plays
  for u_row in df_unclean_plays.itertuples(index=True):
    u_features = [getattr(u_row, col) for col in matching_features]

    matching_indexes = []
    matches_found = False

    # Iterate through each row of the dataframe of cleaned plays
    # - The starting index will be the index of the unclean play within the main original dataframe of plays
    #   - The matching cleaned pair will either be at the exact same location or higher
    for c_row in df_clean_plays[u_row.Index::].itertuples(index=True):
      c_features = [getattr(c_row, col) for col in matching_features]

      # If a match is found, check for consective rows of matches because some uncleaned plays needed to be cleaned using multiple rows
      # - Once a row that does not match follows one that does, will break the loop because the one play match has been found.
      if u_features == c_features:
        matching_indexes.append(c_row.Index)
        matches_found = True
      elif matches_found:
        my_map[u_row.Index] = matching_indexes
        break

  return my_map

# TESTING AREA

In [104]:
week1_2023_plays_copy = week1_2023_plays.copy()

df_week1_plays_cleaned = clean_dataframe_of_plays(week1_2023_plays_copy)

In [71]:
df_week1_plays_cleaned.shape

(2720, 37)

## Rushing plays

In [None]:
# Number of running type plays during 2023, Week 1

df_unclean_run_plays = week1_2023_plays.loc[week1_2023_plays['PlayOutcome'].str.contains('Run')]

map_run_plays = unclean_clean_matches(df_unclean_run_plays, df_week1_plays_cleaned)

len(map_run_plays.keys())

831

In [None]:
# Every unclean passing play and their associated cleaned play breakdown

for i in map_run_plays.keys():
  print(f"({i}, {map_run_plays.get(i)})")
  play = week1_2023_plays['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

(3, [3])
(14:01) J.Cook up the middle to BUF 40 for 3 yards (J.Johnson, J.Franklin-Myers).

(4, [4])
(13:24) (Shotgun) J.Cook up the middle to BUF 42 for 2 yards (Q.Williams; J.Franklin-Myers).

(8, [9])
(9:24) (Shotgun) J.Cook right tackle to BUF 23 for 5 yards (Qu.Williams).

(10, [11])
(8:02) (Shotgun) J.Allen scrambles up the middle to BUF 38 for 14 yards (D.Reed).

(12, [13])
(6:52) (No Huddle, Shotgun) J.Cook up the middle to BUF 49 for 8 yards (D.Reed).

(23, [24])
(10:35) (Shotgun) J.Cook left end to BUF 20 for -5 yards (Q.Williams).

(27, [28])
(8:40) (Shotgun) J.Allen scrambles left end pushed ob at NYJ 48 for 6 yards (Q.Jefferson; C.Mosley).

(28, [29])
(7:59) J.Cook right end ran ob at NYJ 36 for 12 yards (J.Sherwood).

(29, [30])
(7:32) (No Huddle, Shotgun) J.Cook up the middle to NYJ 37 for -1 yards (Qu.Williams, A.Gardner).

(32, [33])
(5:34) (Shotgun) D.Harris up the middle to NYJ 5 for 3 yards (Qu.Williams).

(36, [37])
(2:36) (Shotgun) J.Cook left tackle pushed ob at 

In [None]:
# fumbled rushing plays (not including touchdowns)

df_unclean_rush_fumble_plays = week1_2023_plays.loc[(week1_2023_plays['PlayOutcome'].str.contains('Run')) &
                                                    (week1_2023_plays['PlayDescription'].str.contains('fumbles', case=False))]

for i in unclean_clean_matches(df_unclean_rush_fumble_plays, df_week1_plays_cleaned).items():
  print(i)

(115, [121])
(230, [245])
(756, [816])
(826, [890])
(933, [1002])
(1015, [1089])
(1214, [1303])
(1343, [1436])
(1512, [1616])
(1921, [2047])


In [None]:
dict_unclean_to_clean_rush_fumble_plays = unclean_clean_matches(df_unclean_rush_fumble_plays, df_week1_plays_cleaned)

for i in dict_unclean_to_clean_rush_fumble_plays.keys():
  print(f"({i}, {dict_unclean_to_clean_rush_fumble_plays.get(i)})")
  play = week1_2023_plays['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

(115, [121])
(9:54) Bre.Hall left end to BUF 22 for -1 yards (G.Rousseau)
FUMBLES (G.Rousseau), ball out of bounds at BUF 25.

(230, [245])
(2:08) S.Clifford FUMBLES (Aborted) at CHI 35, and recovers at CHI 35.

(756, [816])
(6:44) (Shotgun) J.Goff Aborted
F.Ragnow FUMBLES at KC 24, recovered by DET-J.Goff at KC 27
J.Goff to KC 27 for no gain (G.Karlaftis).

(826, [890])
(8:53) (Shotgun) D.Jones Aborted
J.Schmitz FUMBLES at DAL 18, recovered by NYG-D.Jones at DAL 27.

(933, [1002])
(9:27) (Shotgun) D.Jones FUMBLES (Aborted) at NYG 30, and recovers at NYG 30
D.Jones to NYG 32 for 2 yards (M.Smith).

(1015, [1089])
(6:33) (No Huddle, Shotgun) L.Jackson scrambles right end to HOU 20 for 6 yards (T.Thomas)
FUMBLES (T.Thomas), recovered by BAL-K.Zeitler at HOU 23
HOU-H.Ridgeway was injured during the play.

(1214, [1303])
(1:39) J.Williams right tackle to TEN 9 for 11 yards (K.Byard, S.Murphy-Bunting)
FUMBLES (S.Murphy-Bunting), and recovers at TEN 9.

(1343, [1436])
(3:02) T.Munford report

## Passing plays

In [None]:
# Number of passing type plays during 2023, Week 1

df_unclean_pass_plays = week1_2023_plays.loc[week1_2023_plays['PlayOutcome'].str.contains('Pass')]

map_passing_plays = unclean_clean_matches(df_unclean_pass_plays, df_week1_plays_cleaned)

len(map_passing_plays.keys())

997

In [None]:
# Every unclean passing play and their associated cleaned play breakdown

for i in map_passing_plays.keys():
  print(f"({i}, {map_passing_plays.get(i)})")
  play = week1_2023_plays['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

(1, [1])
(15:00) (Shotgun) J.Allen pass short right to S.Diggs to BUF 32 for 7 yards (A.Gardner).

(2, [2])
(14:34) (No Huddle, Shotgun) J.Allen pass short left to D.Harty to BUF 37 for 5 yards (Qu.Williams).

(5, [5])
(12:39) (Shotgun) J.Allen pass incomplete short left to S.Diggs.

(9, [10])
(8:44) (Shotgun) J.Allen pass short right to D.Harty to BUF 24 for 1 yard (C.Mosley).

(11, [12])
(7:25) (Shotgun) J.Allen pass short left to D.Harty pushed ob at BUF 41 for 3 yards (A.Amos).

(13, [14])
(6:18) (No Huddle, Shotgun) J.Allen pass short left to D.Knox to NYJ 45 for 6 yards (D.Reed).

(14, [15])
(5:44) (No Huddle, Shotgun) J.Allen pass short right to S.Diggs to NYJ 30 for 15 yards (A.Amos).

(16, [17])
(4:30) (Shotgun) J.Allen pass short right to D.Knox to NYJ 35 for 4 yards (Q.Williams, M.Carter).

(17, [18])
(3:56) (No Huddle, Shotgun) J.Allen pass short left to D.Harris to NYJ 22 for 13 yards (Qu.Williams).

(20, [21])
(13:54) (Shotgun) J.Allen pass short left to J.Cook to BUF 31 

In [None]:
# passing type plays during 2023, Week 1 that have been spiked

df_unclean_pass_plays_spiked = week1_2023_plays.loc[(week1_2023_plays['PlayOutcome'].str.contains('Pass')) &
                                                    (week1_2023_plays['PlayDescription'].str.contains('spiked', case=False))]

map_passing_spiked_plays = unclean_clean_matches(df_unclean_pass_plays_spiked, df_week1_plays_cleaned)

for i in map_passing_spiked_plays.keys():
  print(f"({i}, {map_passing_spiked_plays.get(i)})")
  play = week1_2023_plays['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

(74, [76])
(:17) (No Huddle) J.Allen spiked the ball to stop the clock.

(767, [828])
(:10) (No Huddle) J.Goff spiked the ball to stop the clock.

(1085, [1163])
(:06) (No Huddle) C.Stroud spiked the ball to stop the clock.

(1405, [1500])
(:19) (No Huddle) R.Wilson spiked the ball to stop the clock.

(2395, [2551])
(:24) (No Huddle) K.Pickett spiked the ball to stop the clock.



In [None]:
# passing type plays during 2023, Week 1 that result in touchdown

df_unclean_pass_plays_touchdown = week1_2023_plays.loc[(week1_2023_plays['PlayOutcome'].str.contains('touchdown', case=False)) &
                                                    (week1_2023_plays['PlayDescription'].str.contains('pass', case=False))]

map_passing_touchdown_plays = unclean_clean_matches(df_unclean_pass_plays_touchdown, df_week1_plays_cleaned)

for i in map_passing_touchdown_plays.keys():
  print(f"({i}, {map_passing_touchdown_plays.get(i)})")
  play = week1_2023_plays['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

(33, [34])
(4:51) (Shotgun) J.Allen pass short right to S.Diggs for 5 yards, TOUCHDOWN.

(134, [142])
(4:58) Z.Wilson pass short left to G.Wilson for 3 yards, TOUCHDOWN.

(163, [171])
(6:14) (Shotgun) J.Love pass short middle to R.Doubs for 8 yards, TOUCHDOWN.

(202, [214])
(6:34) (Shotgun) J.Love pass short middle to A.Jones for 35 yards, TOUCHDOWN
GB-A.Jones was injured during the play
His return is Questionable.

(214, [228])
(13:34) J.Love pass short left to R.Doubs for 4 yards, TOUCHDOWN.

(219, [233])
(12:53) (Shotgun) J.Fields pass short middle intended for D.Mooney INTERCEPTED by Q.Walker [K.Clark] at CHI 37
Q.Walker for 37 yards, TOUCHDOWN
PENALTY on GB-R.Douglas, Unsportsmanlike Conduct, 15 yards, enforced between downs.

(289, [310])
(1:04) (Shotgun) J.Fields pass deep right to D.Mooney for 20 yards, TOUCHDOWN.

(363, [389])
(11:33) (Shotgun) A.Richardson pass short left to M.Pittman for 39 yards, TOUCHDOWN.

(419, [449])
(5:26) (Shotgun) T.Lawrence pass short left to C.Ridl

In [None]:
# every passing play that resulted in a fumble (including fumble recoveries resulting in a touchdown)

df_unclean_pass_fumble_plays = week1_2023_plays.loc[((week1_2023_plays['PlayOutcome'].str.contains('Pass')) |
                                                   ((week1_2023_plays['PlayDescription'].str.contains('Touchdown', case=False)) &
                                                   (week1_2023_plays['PlayOutcome'].str.contains('Pass')))) &
                                                   (week1_2023_plays['PlayDescription'].str.contains('fumbles', case=False))]

for i in unclean_clean_matches(df_unclean_pass_fumble_plays, df_week1_plays_cleaned).items():
  print(i)

(213, [227])
(423, [453])
(872, [937])
(961, [1031])
(1605, [1715])
(1931, [2057])
(2295, [2441])


In [None]:
dict_unclean_to_clean_pass_fumble_plays = unclean_clean_matches(df_unclean_pass_fumble_plays, df_week1_plays_cleaned)

for i in dict_unclean_to_clean_pass_fumble_plays.keys():
  # print(i)
  print(f"({i}, {dict_unclean_to_clean_pass_fumble_plays.get(i)})")
  play = week1_2023_plays['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

(213, [227])
(14:21) J.Love to CHI 44 for -3 yards
FUMBLES, and recovers at CHI 46
J.Love pass deep left to L.Musgrave to CHI 4 for 37 yards (T.Stevenson) [D.Walker].

(423, [453])
(14:15) T.Lawrence pass short right to C.Ridley to JAX 47 for 14 yards (R.Thomas, E.Speed)
FUMBLES (E.Speed), RECOVERED by IND-E.Speed at IND 49
E.Speed ran ob at IND 49 for no gain
The Replay Official reviewed the ball was inbounds ruling, and the play was REVERSED
T.Lawrence pass short right to C.Ridley to JAX 47 for 14 yards (R.Thomas, E.Speed)
FUMBLES (E.Speed), ball out of bounds at IND 49
IND-K.Moore was injured during the play
IND-D.Flowers was injured during the play.

(872, [937])
(11:26) (Shotgun) D.Prescott pass short right to T.Pollard to NYG 12 for 7 yards (B.Okereke)
FUMBLES (B.Okereke), recovered by DAL-T.Biadasz at NYG 4.

(961, [1031])
(4:45) (Shotgun) D.Jones pass short left to M.Breida to NYG 43 for 5 yards (M.Bell)
FUMBLES (M.Bell), recovered by NYG-P.Campbell at NYG 35
P.Campbell to NYG 

## Intercepted plays

In [None]:
df_unclean_intercepted_plays = week1_2023_plays.loc[(week1_2023_plays['PlayDescription'].str.contains('INTERCEPTED', case=False)) |
                                                    (week1_2023_plays['PlayOutcome'].str.contains('Interception', case=False))]

dict_unclean_to_clean_intercepted_plays = unclean_clean_matches(df_unclean_intercepted_plays, df_week1_plays_cleaned)

for i in dict_unclean_to_clean_intercepted_plays.keys():
  print(f"({i}, {dict_unclean_to_clean_intercepted_plays.get(i)})")
  play = week1_2023_plays['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

(21, [22])
(13:12) (Shotgun) J.Allen pass deep middle intended for D.Harty INTERCEPTED by J.Whitehead at NYJ 4
J.Whitehead to NYJ 4 for no gain (D.Harty).

(52, [53])
(4:16) (Shotgun) J.Allen pass deep middle intended for S.Diggs INTERCEPTED by J.Whitehead [Q.Williams] at NYJ -1
Touchback.

(64, [66])
(9:49) (Shotgun) J.Allen pass short right intended for G.Davis INTERCEPTED by J.Whitehead at NYJ 43
J.Whitehead ran ob at NYJ 43 for no gain.

(102, [108])
(3:17) Z.Wilson pass short middle intended for R.Cobb INTERCEPTED by M.Milano at NYJ 48
M.Milano to NYJ 35 for 13 yards (Z.Wilson)
PENALTY on BUF-M.Milano, Taunting, 15 yards, enforced at NYJ 35.

(219, [233])
(12:53) (Shotgun) J.Fields pass short middle intended for D.Mooney INTERCEPTED by Q.Walker [K.Clark] at CHI 37
Q.Walker for 37 yards, TOUCHDOWN
PENALTY on GB-R.Douglas, Unsportsmanlike Conduct, 15 yards, enforced between downs.

(388, [417])
(5:09) (Shotgun) A.Richardson pass deep left intended for M.Alie-Cox INTERCEPTED by Ty.Ca

## Punt Plays

In [80]:
df_unclean_punt_plays = week1_2023_plays.loc[week1_2023_plays['PlayDescription'].str.contains('punts', case=False)]

dict_unclean_to_clean_punt_plays = unclean_clean_matches(df_unclean_punt_plays, df_week1_plays_cleaned)

for i in dict_unclean_to_clean_punt_plays.keys():
  print(f"({i}, {dict_unclean_to_clean_punt_plays.get(i)})")
  play = week1_2023_plays['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

(6, [6])
(12:37) S.Martin punts 46 yards to NYJ 12, Center-R.Ferguson, fair catch by X.Gipson.

(56, [57, 58])
(:42) S.Martin punts 53 yards to NYJ 24, Center-R.Ferguson
X.Gipson pushed ob at NYJ 25 for 1 yard (S.Neal).

(84, [87])
(9:33) T.Morstead punts 31 yards to BUF 23, Center-T.Hennessy, out of bounds.

(93, [97])
(15:00) T.Morstead punts 39 yards to BUF 29, Center-T.Hennessy, fair catch by D.Harty.

(122, [128, 129])
(2:21) T.Morstead punts 50 yards to BUF 11, Center-T.Hennessy
D.Harty ran ob at BUF 15 for 4 yards (J.Sherwood).

(126, [133])
(14:19) T.Morstead punts 54 yards to BUF 15, Center-T.Hennessy, fair catch by D.Harty.

(152, [159, 160])
(9:21) S.Martin punts 42 yards to NYJ 35, Center-R.Ferguson
X.Gipson for 65 yards, TOUCHDOWN.

(169, [177, 178])
(:42) D.Whelan punts 42 yards to CHI 29, Center-M.Orzech
T.Taylor to CHI 37 for 8 yards (I.Gaines).

(174, [184])
(7:26) D.Whelan punts 68 yards to end zone, Center-M.Orzech, Touchback.

(182, [192])
(2:19) D.Whelan punts 42 y

## Fumbles

In [None]:
# All fumbled plays

df_unclean_fumble_plays = week1_2023_plays.loc[week1_2023_plays['PlayDescription'].str.contains('fumbles', case=False)]

dict_unclean_to_clean_fumble_plays = unclean_clean_matches(df_unclean_fumble_plays, df_week1_plays_cleaned)

for i in dict_unclean_to_clean_fumble_plays.keys():
  print(f"({i}, {dict_unclean_to_clean_fumble_plays.get(i)})")
  play = week1_2023_plays['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

(66, [68])
(4:55) (Shotgun) J.Allen FUMBLES (Aborted) at BUF 21, and recovers at BUF 21
J.Allen to BUF 25 for 4 yards (M.Clemons)
FUMBLES (M.Clemons), RECOVERED by NYJ-Q.Williams at BUF 27.

(115, [121])
(9:54) Bre.Hall left end to BUF 22 for -1 yards (G.Rousseau)
FUMBLES (G.Rousseau), ball out of bounds at BUF 25.

(213, [227])
(14:21) J.Love to CHI 44 for -3 yards
FUMBLES, and recovers at CHI 46
J.Love pass deep left to L.Musgrave to CHI 4 for 37 yards (T.Stevenson) [D.Walker].

(230, [245])
(2:08) S.Clifford FUMBLES (Aborted) at CHI 35, and recovers at CHI 35.

(282, [303])
(5:08) (Shotgun) J.Fields sacked at CHI 18 for 0 yards (sack split by K.Clark and D.Wyatt)
FUMBLES (K.Clark) [D.Wyatt], RECOVERED by GB-R.Douglas at CHI 28
R.Douglas to CHI 28 for no gain (D.Moore)
PENALTY on GB-D.Campbell, Unnecessary Roughness, 15 yards, enforced at CHI 28.

(353, [379])
(4:19) (Shotgun) A.Richardson pass short middle to D.Jackson to JAX 35 for 6 yards (A.Cisco, F.Oluokun)
FUMBLES (A.Cisco), RE

## Kickoffs

In [105]:
# All kickoff plays

df_unclean_kickoff_plays = week1_2023_plays.loc[week1_2023_plays['PlayOutcome'].str.contains('kickoff', case=False)]

dict_unclean_to_clean_kickoff_plays = unclean_clean_matches(df_unclean_kickoff_plays, df_week1_plays_cleaned)

for i in dict_unclean_to_clean_kickoff_plays.keys():
  print(f"({i}, {dict_unclean_to_clean_kickoff_plays.get(i)})")
  play = week1_2023_plays['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

(0, [0])
G.Zuerlein kicks 65 yards from NYJ 35 to end zone, Touchback.

(22, [23])
G.Zuerlein kicks 65 yards from NYJ 35 to end zone, Touchback.

(44, [45])
G.Zuerlein kicks 65 yards from NYJ 35 to end zone, Touchback.

(65, [68])
G.Zuerlein kicks 65 yards from NYJ 35 to end zone, Touchback.

(67, [70])
G.Zuerlein kicks 65 yards from NYJ 35 to end zone, Touchback.

(85, [88, 89])
T.Bass kicks 61 yards from BUF 35 to NYJ 4
X.Gipson pushed ob at NYJ 22 for 18 yards (T.Rapp).

(99, [103, 104])
T.Bass kicks 67 yards from BUF 35 to NYJ -2
X.Gipson to NYJ 26 for 28 yards (D.Jackson).

(103, [109])
T.Bass kicks 65 yards from BUF 35 to end zone, Touchback.

(105, [111])
T.Bass kicks 65 yards from BUF 35 to end zone, Touchback.

(145, [152])
T.Bass kicks 65 yards from BUF 35 to end zone, Touchback.

(147, [154])
G.Zuerlein kicks 65 yards from NYJ 35 to end zone, Touchback.

(165, [173])
C.Santos kicks 65 yards from CHI 35 to end zone, Touchback.

(170, [179, 180])
C.Santos kicks 69 yards from C

## Index searching

In [None]:
# week1_2023_plays['PlayDescription'].iloc[34]
week1_2023_plays.iloc[0]

Unnamed: 0,0
Season,2023
Week,Week 1
Day,MON
Date,09/11
AwayTeam,Bills
HomeTeam,Jets
Quarter,1ST QUARTER
DriveNumber,1
TeamWithPossession,BUF
IsScoringDrive,0


In [106]:
df_week1_plays_cleaned.iloc[1365]

Unnamed: 0,1365
Season,2023
Week,Week 1
Day,SUN
Date,09/10
AwayTeam,Raiders
HomeTeam,Broncos
Quarter,1ST QUARTER
DriveNumber,1
TeamWithPossession,LV
IsScoringDrive,1


# cleaned dataset observations

## Home and Away teams (Week 1, 2023)

In [None]:
# Season 2023 Week 1 schedule

df_2023_week1_schedule = df_week1_plays_cleaned[['HomeTeam', 'AwayTeam', 'Season', 'Date', 'Day']].drop_duplicates().sort_values(by='Date').reset_index(drop=True)

df_2023_week1_schedule

Unnamed: 0,HomeTeam,AwayTeam,Season,Date,Day
0,Chiefs,Lions,2023,09/07,THU
1,Bears,Packers,2023,09/10,SUN
2,Colts,Jaguars,2023,09/10,SUN
3,Browns,Bengals,2023,09/10,SUN
4,Giants,Cowboys,2023,09/10,SUN
5,Ravens,Texans,2023,09/10,SUN
6,Saints,Titans,2023,09/10,SUN
7,Broncos,Raiders,2023,09/10,SUN
8,Falcons,Panthers,2023,09/10,SUN
9,Vikings,Buccaneers,2023,09/10,SUN


## Offense Stats

Passing Example
1. Top 10 players who threw the ball the most
2. All passing plays from a specified player
3. Total passing yards from the specified player
4. All receivers who caught a pass from specified player
5. Top target receiver from specified player
6. Top target receiver catching yards

In [None]:
# 1. Top 10 players who threw the ball the most

passers = df_week1_plays_cleaned['Passer'].loc[(df_week1_plays_cleaned['Season'] == 2023) &
                                                (df_week1_plays_cleaned['Week'] == 'Week 1')].value_counts().head(10)

passers

Unnamed: 0_level_0,count
Passer,Unnamed: 1_level_1
,2770


In [None]:
# 2. All passing plays from a specified player

passer = 'C.Stroud'

df_passing_plays_by = df_week1_plays_cleaned.loc[(df_week1_plays_cleaned['Passer'] == passer)].sort_index()

df_passing_plays_by

Unnamed: 0,Season,Week,Day,Date,AwayTeam,HomeTeam,Quarter,DriveNumber,TeamWithPossession,IsScoringDrive,...,SackedBy,ForcedFumbleBy,FumbleDetails,ReverseDetails,InjuredPlayers,AcceptedPenalty,DeclinedPenalty,Kicker,LongSnapper,Returner


In [None]:
df_positive_passing_yards = df_passing_plays_by.loc[df_passing_plays_by['Yardage'] > 0]

# df_positive_passing_yards['Yardage'].sum()
df_positive_passing_yards

Unnamed: 0,Season,Week,Day,Date,AwayTeam,HomeTeam,Quarter,DriveNumber,TeamWithPossession,IsScoringDrive,...,SackedBy,ForcedFumbleBy,FumbleDetails,ReverseDetails,InjuredPlayers,AcceptedPenalty,DeclinedPenalty,Kicker,LongSnapper,Returner


In [None]:
# 3. Total passing yards from the specified player

total_passing_yards = df_passing_plays_by['Yardage'].sum()

total_passing_yards

0.0

In [None]:
# 4. All receivers who caught a pass from specified player

df_all_passing_targets = df_passing_plays_by['Receiver'].loc[(df_passing_plays_by['Receiver'] != 'nan')].value_counts()

df_all_passing_targets

Unnamed: 0_level_0,count
Receiver,Unnamed: 1_level_1


In [None]:
# 5. Top target receiver from specified player

df_passers_top_target_plays = df_passing_plays_by.loc[df_passing_plays_by['Receiver'] == df_all_passing_targets.head(1).index.tolist()[0]]

df_passers_top_target_plays

IndexError: list index out of range

In [None]:
# 6. Top target receiver catching yards

df_passers_top_target_plays['Yardage'].sum()

Rushing Example
1. All players who carried the ball from a specified team
2. All rushing plays from top rusher of a specified team
3. Total rushing yards from top rusher of a specified team


In [None]:
# 1. All players who carried the ball from a specified team
# - I need to map team names to their abbreviations in the future
#   - For right now 'Cowboys' == 'DAL'

team_abbreviation = 'DAL'

team_rushers = df_week1_plays_cleaned['Rusher'].loc[(df_week1_plays_cleaned['TeamWithPossession'] == team_abbreviation) &
                                                    (df_week1_plays_cleaned['Rusher'] != 'nan')].value_counts()

team_rushers

In [None]:
# 2. All rushing plays from top rusher of a specified team

df_top_rushers_plays = df_week1_plays_cleaned.loc[df_week1_plays_cleaned['Rusher'] == team_rushers.head(1).index.tolist()[0]]

df_top_rushers_plays

In [None]:
# 3. Total rushing yards from top rusher of a specified team

df_top_rushers_plays['Yardage'].sum()

## Defense Stats

1. All defensive plays from a specified team
2. All solo tackles made form the specified team
3. All plays of the player with the most solo tackles

In [None]:
# 1. All defensive plays from a specified team

team_name = 'Jets'
team_abbreviation = 'NYJ'

df_all_game_plays = df_week1_plays_cleaned.loc[(df_week1_plays_cleaned['HomeTeam'] == team_name) |
                                                (df_week1_plays_cleaned['AwayTeam'] == team_name)]

df_all_defensive_plays = df_all_game_plays.loc[df_all_game_plays['TeamWithPossession'] != team_abbreviation]

df_all_defensive_plays

In [None]:
# 2. All solo tackles made form the specified team

df_all_solo_tackles = df_all_defensive_plays['TackleBy1'].loc[(df_all_defensive_plays['TackleBy1'] != 'nan') &
                                                              (df_all_defensive_plays['TackleBy2'] == 'nan')].value_counts()

df_all_solo_tackles

In [None]:
# 3. All plays of the player with the most solo tackles

df_player_with_most_tackles = df_week1_plays_cleaned.loc[(df_week1_plays_cleaned['TackleBy1'] == df_all_solo_tackles.head(1).index.tolist()[0]) &
                                                         (df_week1_plays_cleaned['TackleBy2'] == 'nan')]

df_player_with_most_tackles