<a href="https://colab.research.google.com/github/KeoniM/NFL_Data_Cleaning/blob/main/NFL_Plays_Week1_2023.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

PURPOSE:
- Correctly clean a week sample size of plays
  - Season 2023 -> Week 1

CONCERNS FOR LATER:
- Players with the same name
  - The current goal is to use the fewest possible indicators or features to distinguish between players with the same name.
    - Maybe there is a more simple way?
- Trick plays
  - latterals  
- Cleaning check
  - I need to figure out how to create some type of check to make sure that these plays are being cleaned correctly.

LATER IDEAS:
- Use 'fuzzywuzzy' to group like play outcomes to parse different plays
- Map team name with their abbreviations ( e.g. "Cowboys" <-> "DAL" )
  - Maybe with larger datasets with multiple weeks, I can map team names with team abbrevations that match up the most.

DEFENSE GOALS:
- Grab/Decifer/categorize 'Passes Defended'

# MOUNTING AND IMPORTS

In [None]:
# Mount your Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Used to access personal google cloud services
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


In [None]:
# Imports

# Data manipulation
import pandas as pd

# Regular expressions
import re

# Grab data from database
from google.cloud import bigquery

In [None]:
# # debugger (maybe use in the future)
# %pdb on

# LOADING DATA (BigQuery queries)

In [None]:
# Client connect to bigquery project
client = bigquery.Client('nfl-data-430702')

## Season 2023 Week 1

In [None]:
# Grabbing all plays from 2023 Week 1 NFL Sesason
week1_2023_plays_query = """
                         SELECT *
                         FROM `nfl-data-430702.NFL_Scores.NFL-Plays-Week1_2023`
                         """

# Running psuedo query, and returns the amount of bytes it will take to run query
dry_run_config = bigquery.QueryJobConfig(dry_run=True)
dry_run_query = client.query(week1_2023_plays_query, job_config=dry_run_config)
print("This query will process {} bytes.".format(dry_run_query.total_bytes_processed))

# Running query (Being mindful of the amount of data being grabbed)
# Will grab a maximum of a Gigabyte
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**9)
safe_config_query = client.query(week1_2023_plays_query, job_config=safe_config)

This query will process 570194 bytes.


In [None]:
# Putting data attained from query into a dataframe
week1_2023_plays = safe_config_query.to_dataframe()

In [None]:
week1_2023_plays.head()

Unnamed: 0,Season,Week,Day,Date,AwayTeam,HomeTeam,Quarter,DriveNumber,TeamWithPossession,IsScoringDrive,PlayNumberInDrive,IsScoringPlay,PlayOutcome,PlayDescription,PlayStart
0,2023,Week 1,MON,09/11,Bills,Jets,1ST QUARTER,1,BUF,0,1,0,Kickoff,G.Zuerlein kicks 65 yards from NYJ 35 to end z...,Kickoff from NYJ 35
1,2023,Week 1,MON,09/11,Bills,Jets,1ST QUARTER,1,BUF,0,2,0,7 Yard Pass,(15:00) (Shotgun) J.Allen pass short right to ...,1st & 10 at BUF 25
2,2023,Week 1,MON,09/11,Bills,Jets,1ST QUARTER,1,BUF,0,3,0,5 Yard Pass,"(14:34) (No Huddle, Shotgun) J.Allen pass shor...",2nd & 3 at BUF 32
3,2023,Week 1,MON,09/11,Bills,Jets,1ST QUARTER,1,BUF,0,4,0,3 Yard Run,(14:01) J.Cook up the middle to BUF 40 for 3 y...,1st & 10 at BUF 37
4,2023,Week 1,MON,09/11,Bills,Jets,1ST QUARTER,1,BUF,0,5,0,2 Yard Run,(13:24) (Shotgun) J.Cook up the middle to BUF ...,2nd & 7 at BUF 40


In [None]:
# Noting the original size of the raw uncleaned dataframe of data
# - (rows, columns)
week1_2023_plays.shape

(2600, 15)

# CATEGORIZE PLAYS
- The goal here is to parse out the different values for 'PlayOutcome'
  - This is where I will separate different types of plays
    - ( pass / run / kickoff / etc. )

In [None]:
# Maybe try to fuzzywuzzy this in the future?

# All play outcomes from the game
# - From here we can categorize and clean plays accordingly
week1_2023_plays['PlayOutcome'].unique()

array(['Kickoff', '7 Yard Pass', '5 Yard Pass', '3 Yard Run',
       '2 Yard Run', 'Pass Incomplete', 'Punt', '-5 Yard Penalty',
       '5 Yard Run', '1 Yard Pass', '14 Yard Run', '3 Yard Pass',
       '8 Yard Run', '6 Yard Pass', '15 Yard Pass', '-9 Yard Sack',
       '4 Yard Pass', '13 Yard Pass', 'Field Goal', '-2 Yard Sack',
       'Interception', '-5 Yard Run', '18 Yard Pass', '8 Yard Pass',
       '6 Yard Run', '12 Yard Run', '-1 Yard Run', '26 Yard Pass',
       'Touchdown Bills', 'Extra Point Good', '13 Yard Run',
       '-3 Yard Sack', '7 Yard Run', '9 Yard Pass', '4 Yard Run',
       'Fumble', '-10 Yard Penalty', '10 Yard Pass', '26 Yard Run',
       '5 Yard Penalty', '-10 Yard Sack', '22 Yard Pass', '-4 Yard Run',
       '-12 Yard Sack', '83 Yard Run', '1 Yard Run', '2 Yard Pass',
       '10 Yard Run', 'Run for No Gain', '12 Yard Pass', '20 Yard Pass',
       '9 Yard Run', '-2 Yard Pass', 'Sack', '24 Yard Pass',
       '14 Yard Pass', 'Touchdown Jets', '-3 Yard Run', '-2 Yar

In [None]:
# NOTE:
# There are more play types that I have not made yet for Week 1.

# Eyeing at all unique play outcomes to categorizing them.
# - This type of approach does not feel very flexable because a play outcome can
#   arise that has not been seen yet.
# - There may be more in the future when working on a full season, let alone all seasons and future games

# Play Types Complete
df_2023_pass_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Pass')]
df_2023_run_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Run')]
df_2023_interception_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Interception')]



# Play Types currently working on
df_2023_touchdown_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Touchdown')]
df_2023_sack_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Sack')] # <-- Next



# Play types need to work on
# df_2023_punt_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Punt')]

# df_2023_kickoff_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Kickoff')]
# df_2023_fumble_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Fumble')]
# df_2023_penalty_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Penalty')]
# df_2023_fieldgoal_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Field Goal')]
# df_2023_extrapoint_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Extra Point')]

# plays_list = [df_2023_pass_sb,
#               df_2023_run_sb,
#               df_2023_punt_sb,
#               df_2023_sack_sb,
#               df_2023_kickoff_sb,
#               df_2023_fumble_sb,
#               df_2023_interception_sb,
#               df_2023_penalty_sb,
#               df_2023_fieldgoal_sb,
#               df_2023_touchdown_sb,
#               df_2023_extrapoint_sb]

In [129]:
for idx, play in df_2023_pass_week1['PlayDescription'].items():
  if play.find('PENALTY') != -1:
  # if play.find('Penalty') != -1:
    print(idx)
    print(play)
    print()

352
(4:43) (Shotgun) A.Richardson pass short middle to A.Ogletree to IND 44 for 20 yards (A.Cisco). PENALTY on JAX-A.Cisco, Unnecessary Roughness, 15 yards, enforced at IND 44.

994
(13:03) (Shotgun) L.Jackson pass incomplete short right [J.Pitre]. PENALTY on BAL-L.Jackson, Intentional Grounding, 14 yards, enforced at HOU 42.

1110
(:04) (No Huddle) C.Stroud pass short right to T.Quitoriano pushed ob at BAL 17 for 11 yards (R.Smith; K.Hamilton). PENALTY on BAL-J.Madubuike, Face Mask, 9 yards, enforced at BAL 17.

1350
(8:14) (Shotgun) J.Garoppolo pass short middle to J.Meyers to DEN 41 for 16 yards (D.Mathis). PENALTY on DEN-E.Bassey, Roughing the Passer, 15 yards, enforced at DEN 41.

1359
(3:00) (Shotgun) J.Garoppolo pass short middle to J.Meyers to DEN 49 for 7 yards (K.Jackson). LV-J.Meyers was injured during the play.  PENALTY on DEN-K.Jackson, Unnecessary Roughness, 15 yards, enforced at DEN 49.

1854
(1:38) (Shotgun) J.Herbert pass incomplete short right. PENALTY on LAC-J.Herber

## SANITY CHECK (All Plays Accounted for)
- NOT COMPLETE
  - Still need to grab other play types
    - Once all plays have been categorizing, will compare the sum to the size of the original dataframe of plays

In [None]:
# Empty for now.

# HELPER METHODS (personal use)
- For personal use, does not actually take part in cleaning dataset at all.

In [None]:
# PURPOSE:
# - Quick look at a section of plays
#   - Ideally the plays that the user wants to break down and clean.
# INPUT PARAMETERS:
# df_all_plays      - DataFrame - The original dataframe where the desired plays to view came from
# df_section_plays  - DataFrame - A section of the original dataframe the user wants to view
# RETURN:
# - Printing to the console:
#   1. index of play
#   2. 'PlayDescription' feature of play
#   3. 'PlayOutcome' feature of play
def print_plays(df_all_plays, df_section_plays):
  for idx, value in df_section_plays['PlayOutcome'].items():
    play = df_all_plays['PlayDescription'].iloc[idx]
    print("index:" + str(idx))
    for i in play.split(". "):
      print(i)
    print(value)
    print()

# PIPELINE
  - ORDER
    1. Regular expressions
      - Used to find common patterns within raw data
    1. Cleaning methods
      - Unique cleaning methods for each play type
        - Some methods may include helper methods
    2. Main pipeline method
      - Control flow of cleaning methods



## 1. REGULAR EXPRESSIONS

In [None]:
####################################################
# REGULAR EXPRESSIONS USED TO LOCATE SPECIFIC DATA #
####################################################

################
# PLAY DETAILS #
################

time_on_clock_pattern = r'\((\d*:\d+)\)'
formation = r'\(([A-Za-z]+ ?[A-Za-z]*,? ?[A-Za-z]*)\)'
yardage_gained = r'for (-?[0-9]+) yards?'

#################
# NAMES OFFENSE #
#################

name_pattern = "(?:[A-Za-z]+-)*[A-Za-z]+\.[A-Za-z]+(?:-[A-Za-z]+)*"
passer_name_pattern = f"({name_pattern}) pass" # All passers are exclusively followed by ' pass'
sacked_passer_name_pattern = f"({name_pattern}) sacked" # All sacked passers are exclusively followed by ' sacked'
receiver_name_pattern = f"to ({name_pattern})" # All receivers exclusively follow 'to '
intended_receiver_name_pattern = f"intended for ({name_pattern})" # intended receiver on an intercepted play
rusher_pattern = f"({name_pattern})(?: scrambles)? (?:left|right|up|kneels|Aborted|FUMBLES).?"

#################
# NAMES DEFENSE #
#################

defense_tackler_1_name_pattern = f"\(({name_pattern})" # Will have a "(" in front of the name
defense_tackler_2_name_pattern = f" ({name_pattern})\)" # Will have a ")" at the end of the name
# MIGHT NEED TO CHANGE:
# - I think it might be possible for multiple defenders to apply pressure to the passer.
defense_pressure_name_pattern = f"\[({name_pattern})\]" # Surrounded by "[]" brackets
interception_name_pattern = f"INTERCEPTED by ({name_pattern})"
split_sack_pattern = f"sack split by ({name_pattern}) and ({name_pattern})"

#######################
# PATTERNS ON FUMBLES #
#######################

qb_fumble = f" ({name_pattern}) to(?: [A-Z]+) [0-9]+ for -?[0-9]+ yards$" # Passer fumbles are always the initial action on the play,
#                                                                           will have time displayed before action and possibly formation too
# run_after_recovery = f"^({name_pattern}) to [A-Z]+ [0-9]+ for " # yardage after recovery (formatted almost exactly like a regular run play)
run_after_recovery = f"^({name_pattern}) to(?: [A-Z]+) [0-9]+ for " # yardage after recovery (formatted almost exactly like a regular run play)
run_after_interception = f"({name_pattern}) (?:pushed ob at|ran ob at|to)(?: [A-Z]+)? [0-9]+ for " # yardage after interception
touchdown_after_interception = f"({name_pattern}) for [0-9]+ yards, TOUCHDOWN" # touchdown after interception

##############
#  INJURIES  #
##############

injury = f"[A-Z]+-({name_pattern}) was injured during the play" # Returns the player(s) who go injuried during play

## 2. CLEANING METHODS

#### pass helper method (Fumbles)


In [None]:
# FOR PASSING ONLY RIGHT NOW
# - A possible goal down the road is to create a single method that can handle
#   all fumble situations, whether it be a running fumble or a passing fumble.

# PURPOSE:
# - Extract fumble data from fumbled plays.
#   - The goal is to strictly grab data that can only appear during fumbled plays,
#     while attempting to push all commonly formated play type data to main cleaning methods.

# NOTE:
# - It is common for a single fumbled play row to be divided into multiple rows.
#   - For example, an intended play has been fumbled and a player recoveres the fumble for a touchdown.
#     - This will be split into 2 separate rows, (1) the intended play row and (2) the fumble recovery row.
#   - The concern here is making sure those rows within the main dataframe of
#     plays are tied together in some way, to signify that the multiple rows
#     are not different plays but all instances of the same.
#     - A solution here could be the features the multi play rows share.
#       - For example, (TimeOnTheClock, Week, Quarter, DriveNumber, PlayNumberInDriver, etc..)

#####################################################
# ROUGH DESIGN OF SINGLE ROW PLAY -> MULTI ROW PLAY #
#####################################################

# - SINGLE PLAY ROW TO SINGLE PLAY ROW(S) METHOD:
#   1. Split play into appropriate divisions (e.g. 1 row -> 3 rows)
#      a. (row 1) - passer fumble
#      b. (row 2) - passing play
#      c. (row 3) - recovery for yardage
#      NOTE:
#      - These are all instances that call for a split
#      - This will always be the cronological order
#        - Any row out of these can be missing depending on the play.
#   2. Clean each row individually
#      1. Transform data into individual single row dataframes
#      2. Run each row through appropriate cleaning method (e.i. passing, running, ...)
#   3. Organize rows cronologically
#      1. Create single dataframe containing all individual rows

# - REPLACING PLAY WITHIN MAIN DATAFRAME:
#   1. return single play multi row dataframe(?)
#      -> MAIN CLEANING METHOD:
#         1. replace original play row with new single play multi rows
#            1. identify index of original play
#            2. break main dataframe in 2 pieces
#               a. Dataframe 1 - dataframe before index (exclusive)
#               a. Dataframe 2 - dataframe after index (exclusive)
#            3. concat new dataframe (Dataframe 1 +
#                                     single play multi row dataframe +
#                                     Dataframe 2)
#         2. rerun main cleaning method (recursion)
#            - manually insert index after last added row to pick up where it left off
#            - exit case will be when the last passing type play has been cleaned

##########################
# EXAMPLE PLAY BREAKDOWN #
##########################

# PLAY (WITH NOTES):
# (14:21) J.Love to CHI 44 for -3 yards <- signal for an additional row needed (passer fumble: grabbing passer name and yardage)
# FUMBLES, and recovers at CHI 46 <- added to play feature 'FumbleDetails'
# J.Love pass deep left to L.Musgrave to CHI 4 for 37 yards (T.Stevenson) [D.Walker]. <- pass to main breakdown method (follows traditional passing play format)
# NOTE:
# - If the fumble was to be recovered and ran for yardage, that would also call for an additional row needed.
# EXAMPLE:
# (4:45) (Shotgun) D.Jones pass short left to M.Breida to NYG 43 for 5 yards (M.Bell)
# FUMBLES (M.Bell), recovered by NYG-P.Campbell at NYG 35
# P.Campbell to NYG 33 for -2 yards <- signal for an additional row needed (fumble recovery for yards: grabbing player who recovered and yardage)
# Officially, a pass for -3 yards.

def extract_fumble_data_pass(df_plays, play, play_index):

  # Separating each sentence within 'PlayDescription' (each sentence represents a single action)
  play_elements = play.split(". ")
  # Collecting fumble data in the exact order in which it happened.
  extracted_fumble_details = [None] * len(play_elements)
  back_to_main_cleaning_method = []

  # list for plays that need multiple rows
  multi_row_play = []
  # lists to collect distinct actions that will become their own rows
  passer_fumble = []
  fumble_recovery = []

  for i in play_elements:
    # Assume everything is going back to main cleaning method
    back_to_main_cleaning_method.append(i)

    # Passer fumble
    # 1. Isolate the passer fumble action. (Take out of list going back to main cleaning method)
    # 2. create new row (dataframe) with passer fumble action
    # 3. clean newly created row (dataframe)
    #    - QUESTION: Should 'PlayType' remain as 'pass' or should it be something else..?
    #      - For now it will be 'run'.
    # 4. append newly created row to 'passer_fumble'
    #    - will be a list of single row dataframes (only expecting this list to have 1 element)
    #    - POTENTIAL ERROR:
    #      - qb_fumble I believe will not pick up REVERSED plays that initially start with a qb fumble.
    passer_fumble_action = re.findall(qb_fumble, i)
    if len(passer_fumble_action) > 0:
      # 1. Isolate the passer fumble action. (Take out of list going back to main cleaning method)
      back_to_main_cleaning_method.pop(back_to_main_cleaning_method.index(i))
      # 2. create new row (dataframe) with passer fumble action
      passer_fumble_row = df_plays.iloc[play_index].copy()
      passer_fumble_row['PlayDescription'] = i
      passer_fumble_row = pd.DataFrame([passer_fumble_row], columns=df_plays.columns)
      # 3. clean newly created row (dataframe)
      passer_fumble_row['PlayOutcome'] = 'Run' # <- This is ugly. Without this, the cleaning method for run plays will not clean.
      cleaned_passer_fumble_row = clean_run_plays(passer_fumble_row)
      cleaned_passer_fumble_row['PlayOutcome'] = df_plays.at[play_index, 'PlayOutcome'] # <- This is ugly.
      #                                                                                      Switching 'PlayOutcome' back to it's shared value
      #                                                                                      with the rest of the grouped rows representing the play.
      # 4. append newly created row to 'passer_fumble'
      passer_fumble.append(cleaned_passer_fumble_row)

    # Fumble sentences to (fumble details)
    if i.find('FUMBLES') != -1:
      back_to_main_cleaning_method.pop(back_to_main_cleaning_method.index(i))
      extracted_fumble_details.pop(play_elements.index(i))
      extracted_fumble_details.insert(play_elements.index(i), i)

    # Recovery for yardage
    # 1. Isolate the recovery for yardage action
    # 2. create new row (dataframe) with recovery for yardage action
    # 3. clean newly created row (dataframe)
    # 4. append newly created row to 'fumble_recovery'
    fumble_recovery_action = re.findall(run_after_recovery, i)
    if len(fumble_recovery_action) > 0:
      # 1. Isolate the recovery for yardage action
      back_to_main_cleaning_method.pop(back_to_main_cleaning_method.index(i))
      # 2. create new row (dataframe) with recovery for yardage action
      fumble_recovery_row = df_plays.iloc[play_index].copy()
      fumble_recovery_row['PlayDescription'] = i
      fumble_recovery_row = pd.DataFrame([fumble_recovery_row], columns=df_plays.columns)
      # 3. clean newly created row (dataframe)
      fumble_recovery_row['PlayOutcome'] = 'Run' # <- This is ugly. Without this, 'clean_run_plays' will not clean.
      cleaned_fumble_recovery_row = clean_run_plays(fumble_recovery_row)
      cleaned_fumble_recovery_row['PlayOutcome'] = df_plays.at[play_index, 'PlayOutcome'] # <- This too is ugly.
      #                                                                                             Switching 'PlayOutcome' back to it's shared value
      #                                                                                             with the rest of the grouped rows representing the play.
      # 4. append newly created row to 'fumble_recovery'
      fumble_recovery.append(cleaned_fumble_recovery_row)

  ##################################################
  # COMBINING ROWS FOR PLAYS THAT REQUIRE MULTIPLE #
  ##################################################

  # Check to see if additional rows are needed (e.i. if there are any elements within these 2 lists)
  if len(passer_fumble) + len(fumble_recovery) > 0:
    # Creating and cleaning row for intended play
    # - Cleaning all data that was going to be sent back to the main cleaning method
    main_play_row = df_plays.iloc[play_index].copy()
    main_play_row['PlayDescription'] = '. '.join(back_to_main_cleaning_method)
    main_play_row = pd.DataFrame([main_play_row], columns=df_plays.columns)
    cleaned_main_play_row = clean_pass_plays(main_play_row)
    # Organize rows cronologically
    # 1. (row 1) - passer fumble
    # 2. (row 2) - passing play
    # 3. (row 3) - recovery for yardage
    multi_row_play.extend(passer_fumble)
    multi_row_play.append(cleaned_main_play_row)
    multi_row_play.extend(fumble_recovery)
    # Creating dataframe to group the divided single play rows
    df_split_single_play = pd.DataFrame(columns=df_plays.columns)
    # Iterate through each row and add to dataframe
    for i in multi_row_play:
      # Add the single play's 'FumbleDetails' to each row
      if len(extracted_fumble_details) > 0:
        # 'multi_row_play' is a list full of single row dataframes.
        # - This means that there is only one index for every dataframe within 'multi_row_play'
        row_index = i.index[0]
        i.at[row_index, 'FumbleDetails'] = extracted_fumble_details
      # Combining each row, all peices of a single play, into a dataframe
      if df_split_single_play.empty:
        df_split_single_play = i # Pandas depricating the ability to concat an empty dataframe with one that is not.
      else:
        df_split_single_play = pd.concat([df_split_single_play, i], ignore_index=True)
    return None, None, df_split_single_play

  # returning empty dataframe because there will be zero additional rows added
  return extracted_fumble_details, back_to_main_cleaning_method, pd.DataFrame()

### PASS PLAYS

In [131]:
# PURPOSE:
# - Clean all passing type plays within a given dataframe.
# INPUT PARAMETERS:
# df_plays    - dataframe - NFL plays (can include play types other than passing)
# index_start -  integer  - index where within the dataframe the method will start
#                           cleaning in ascending order.
# RETURN:
# df_plays - dataframe - the same df_plays input but with all passing play types cleaned

def clean_pass_plays(df_plays, index_start = None):

  # Adjusting df_plays to start cleaning at a specified index (index_start)
  if index_start != None:
    df_plays_adjusted = df_plays.iloc[df_plays.index.tolist().index(index_start):]
    # Locating all passing type plays within dataframe
    df_pass_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('Pass')]
  else:
    # Locating all passing type plays within dataframe
    df_pass_plays = df_plays[df_plays['PlayOutcome'].str.contains('Pass')]

  for idx, play in df_pass_plays['PlayDescription'].items():

    ################
    # Play details #
    ################

    # Play Type
    df_plays.loc[idx, 'PlayType'] = 'Pass'

    # TimeOnTheClock
    TimeOnTheClock = re.findall(time_on_clock_pattern, play)
    if len(TimeOnTheClock) > 0:
      df_plays.loc[idx, 'TimeOnTheClock'] = TimeOnTheClock[0]

    ############
    # REVERSES #
    ############

    # In 'PlayDescription' all information before the "reversed" sentence is not needed.
    # - All information before is stored within 'ReverseDetails' and the remaining is cleaned.
    if play.find('REVERSED') != -1:
      play_elements = play.split(". ")
      for i in play_elements:
        if i.find("REVERSED") != -1:
          df_plays.at[idx, 'ReverseDetails'] = play_elements[:play_elements.index(i) + 1]
          play = ". ".join(play_elements[play_elements.index(i) + 1:])
          break

    ############################
    # REPORTING IN AS ELIGIBLE #
    ############################

    # I do not think this contains any useful data so I am going to exclude it.
    if play.find('reported in as eligible') != -1:
      play_elements = play.split(". ")
      for i in play_elements:
        if i.find('reported in as eligible') != -1:
          play = ". ".join(play_elements[play_elements.index(i) + 1:])
          break

    ###########
    # FUMBLES #
    ###########

    # Additional rows may be added after certain types of fumbled passing plays.
    # - The idea here is that, in those situations, the helping method 'extract_fumble_data_pass'
    #   will return a small dataframe of the rows that the single play split into.
    #   - When this small dataframe is returned, it will need to replace the original play
    #     within the main dataframe of plays and then continue on cleaning the rest of the passing plays.

    if play.find('FUMBLES') != -1:
      fumble_details, play, df_added_rows = extract_fumble_data_pass(df_plays, play, idx)
      if not df_added_rows.empty:
        df_before = df_plays.iloc[:idx]
        df_after = df_plays.iloc[idx+1:]
        df_plays = pd.concat([df_before, df_added_rows, df_after], ignore_index=True)
        index_of_last_added_row = idx + len(df_added_rows) - 1
        return clean_pass_plays(df_plays, index_of_last_added_row)

      df_plays.at[idx, 'FumbleDetails'] = fumble_details
      play = ". ".join(play)

    ###########
    # OFFENSE #
    ###########

    # NOTE:
    # - Incomplete passes will have 'PlayOutcome' as 'Pass Incomplete' as well
    #   as yardage value being 0.0

    # Yardage gained
    yardage = re.findall(yardage_gained, play)
    if len(yardage) > 0:
      df_plays.loc[idx, 'Yardage'] = int(yardage[0])
    else:
      df_plays.loc[idx, 'Yardage'] = 0

    # Formation
    Formation = re.findall(formation, play)
    if len(Formation) > 0:
      if Formation[0] == 'Aborted':
        pass
      else:
        df_plays.loc[idx, 'Formation'] = Formation[0]

    # Passer (What about spikes?)
    passer_name = re.findall(passer_name_pattern, play)
    if len(passer_name) > 0:
      df_plays.loc[idx, 'Passer'] = passer_name[0]

    # Pass Type
    if play.find('deep') != -1:
      df_plays.loc[idx, 'PassType'] = 'Deep'
    elif play.find('short') != -1:
      df_plays.loc[idx, 'PassType'] = 'Short'

    # Pass Direction
    if play.find('left') != -1:
      df_plays.loc[idx, 'Direction'] = 'Left'
    elif play.find('right') != -1:
      df_plays.loc[idx, 'Direction'] = 'Right'
    elif play.find('middle') != -1:
      df_plays.loc[idx, 'Direction'] = 'Middle'

    # Unique situation (offense spikes the ball)
    if play.find('spike') != -1:
      df_plays.loc[idx, 'PassType'] = 'Spike'
      df_plays.loc[idx, 'Passer'] = re.findall(name_pattern, play)[0]

    # Receiver
    receiver_names = re.findall(receiver_name_pattern, play)
    if len(receiver_names) > 0:
      df_plays.loc[idx, 'Receiver'] = receiver_names[0]

    #############
    #  DEFENSE  #
    #############

    # Difference between ", " and "; " separating tacklers
    # ', ' - both defenders worked together to make the tackle
    # "; " - first defender initiated hit and second finished
    # - Should I mark the differences?

    tackler_1 = re.findall(defense_tackler_1_name_pattern, play) # tackler #1 (Could be solo or the one who initiated the hit)
    if len(tackler_1) > 0:
      df_plays.loc[idx, 'TackleBy1'] = tackler_1[0]

    tackler_2 = re.findall(defense_tackler_2_name_pattern, play) # tackler #2 (equally contributed or assisted with tackle)
    if len(tackler_2) > 0:
      df_plays.loc[idx, 'TackleBy2'] = tackler_2[0]

    pressure_by = re.findall(defense_pressure_name_pattern, play) # defender who applied pressure to the passer
    if len(pressure_by) > 0:
      df_plays.loc[idx, 'PressureBy'] = pressure_by[0]

    ##############
    #  INJURIES  #
    ##############

    injuries = re.findall(injury, play)
    if len(injuries) > 0:
      df_plays.at[idx, 'InjuredPlayers'] = injuries

    #############
    #  PENALTY  #
    #############

    # Accepted Penalty
    if play.find('PENALTY') != -1:
      play_elements = play.split(". ")
      penalties = []
      for i in play_elements:
        if i.find('PENALTY') != -1:
          penalties.append(i)
      df_plays.at[idx, 'AcceptedPenalty'] = penalties

    # Declined Penalty
    if play.find('Penalty') != -1:
      play_elements = play.split(". ")
      penalties = []
      for i in play_elements:
        if i.find('Penalty') != -1:
          penalties.append(i)
      df_plays.at[idx, 'DeclinedPenalty'] = penalties

  if df_pass_plays.tail(1).index.tolist()[0] == idx:
    return df_plays

### run helper method (Fumbles)
- Goal might be to combine both pass and run helper methods for fumbles

In [None]:
# PURPOSE:
# - Extract fumble details and push back data from fumbled plays that can be broken
#   down by the main play cleaning method.
# INPUT PARAMTERS:
# df_plays   - dataframe - dataframe of plays
# play       -  string   - 'PlayDescription' of play that contains a fumble
# play_index -  integer  - index of the fumbled play within 'df_plays'
# RETURN (TUPLE):
# extracted_fumble_details     -   list    - all details of the fumbled play that contain data
#                                            that is of less importance
#                                             - The reason for this is to save space. It does not
#                                               make sense to have features for this data when
#                                               1/100 plays will contain a fumble.
# back_to_main_cleaning_method -   list    - All details of the fumbled play that can be broken
#                                            down by the main play cleaning method.
# df_split_single_play         - dataframe - When a single play needs to be split into separate rows,
#                                            this will return a dataframe of that single play into split
#                                            rows.

def extract_fumble_data_run(df_plays, play, play_index):

  # 'PlayDescription' is made up of a group of sentences, each containing individual actions of the play.
  play_elements = play.split(". ")
  extracted_fumble_details = [None] * len(play_elements)
  back_to_main_cleaning_method = []

  # list for plays that need multiple rows
  multi_row_play = []
  # To collect distinct actions that will become their own rows
  df_fumble_recovery = []

  for i in play_elements:
    # Assuming everything is going back to main cleaning method
    back_to_main_cleaning_method.append(i)

    # Aborted sentences to both (fumble details & main cleaning method)
    if i.find('Aborted') != -1:
      extracted_fumble_details.pop(play_elements.index(i))
      extracted_fumble_details.insert(play_elements.index(i), i)
      continue

    # Fumble sentences to (fumble details)
    if i.find('FUMBLES') != -1:
      back_to_main_cleaning_method.pop(back_to_main_cleaning_method.index(i))
      extracted_fumble_details.pop(play_elements.index(i))
      extracted_fumble_details.insert(play_elements.index(i), i)

    # Recovery for yardage
    # 1. Isolate the recovery for yardage action (Take out of list going back to main cleaning method)
    # 2. create new row (dataframe) with recovery for yardage action
    # 3. clean newly created row (dataframe)
    # 4. append newly created row to 'df_fumble_recovery'
    fumble_recovery_action = re.findall(run_after_recovery, i)
    if len(fumble_recovery_action) > 0:
      # 1. Isolate the recovery for yardage action (Take out of list going back to main cleaning method)
      back_to_main_cleaning_method.pop(back_to_main_cleaning_method.index(i))
      # 2. create new row (dataframe) with recovery for yardage action
      recovery_for_yardage_row = df_plays.iloc[play_index].copy()
      recovery_for_yardage_row['PlayDescription'] = i
      recovery_for_yardage_row = pd.DataFrame([recovery_for_yardage_row], columns=df_plays.columns)
      # 3. clean newly created row (dataframe)
      #    - Will clean without a problem because 'PlayOutcome' has 'Run' in its value.
      cleaned_recovery_for_yardage_row = clean_run_plays(recovery_for_yardage_row)
      # 4. append newly created row to 'df_fumble_recovery'
      df_fumble_recovery.append(cleaned_recovery_for_yardage_row)

  ##################################################
  # COMBINING ROWS FOR PLAYS THAT REQUIRE MULTIPLE #
  ##################################################

  # Check to see if additional rows are needed (e.i. if there are any elements within the lists)
  if len(df_fumble_recovery) > 0:
    # Creating and cleaning row for intended play
    main_play_row = df_plays.iloc[play_index].copy()
    main_play_row['PlayDescription'] = '. '.join(back_to_main_cleaning_method)
    main_play_row = pd.DataFrame([main_play_row], columns=df_plays.columns)
    cleaned_main_play_row = clean_run_plays(main_play_row)

    # Organize rows cronologically
    # 1. (row 1) - running play
    # 2. (row 2) - recovery for yardage
    multi_row_play.append(main_play_row)
    multi_row_play.extend(df_fumble_recovery)
    # Creating dataframe to group the divided single play rows
    df_split_single_play = pd.DataFrame(columns=df_plays.columns)
    # Iterate through each row and add to dataframe
    for i in multi_row_play:
      if len(extracted_fumble_details) > 0:
        # 'multi_row_play' is a list full of single row dataframes.
        # - This means that there is only one index for every dataframe within 'multi_row_play'
        row_index = i.index[0]
        i.at[row_index, 'FumbleDetails'] = extracted_fumble_details
      # Combining each row, all peices of a single play, into a dataframe
      if df_split_single_play.empty:
        df_split_single_play = i # Pandas depricating the ability to concat an empty dataframe with one that is not.
      else:
        df_split_single_play = pd.concat([df_split_single_play, i], ignore_index=True)

    return None, None, df_split_single_play

  # returning empty dataframe because there will be zero additional rows added
  return extracted_fumble_details, back_to_main_cleaning_method, pd.DataFrame()

### RUN PLAYS

In [None]:
# PURPOSE:
# - Clean run type plays
# INPUT PARAMETERS:
# df_plays    - dataframe - dataframe of plays
# index_start -  integer  - the starting index of the associated input dataframe
#                           to begin cleaning. (Needs to be the index of a run play)
# RETURN:
# df_plays - dataframe - dataframe of plays that now has all useful run play
#                        data accessable and clean.

def clean_run_plays(df_plays, index_start = None):

  if index_start != None:
    df_plays_adjusted = df_plays.iloc[df_plays.index.tolist().index(index_start):]
    df_run_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('Run')]
  else:
    df_run_plays = df_plays[df_plays['PlayOutcome'].str.contains('Run')]

  # Iterating through every run play within 'df_run_plays'
  for idx, play in df_run_plays['PlayDescription'].items():

    ################
    # Play details #
    ################

    # Play Type
    df_plays.loc[idx, 'PlayType'] = 'Run'

    # TimeOnTheClock
    TimeOnTheClock = re.findall(time_on_clock_pattern, play)
    if len(TimeOnTheClock) > 0:
      df_plays.loc[idx, 'TimeOnTheClock'] = TimeOnTheClock[0]

    ############
    # REVERSES #
    ############

    # In 'PlayDescription' all information before the "reversed" sentence is not needed.
    # - All information before is stored within 'ReverseDetails' and the remaining is cleaned.
    if play.find('REVERSED') != -1:
      play_elements = play.split(". ")
      for i in play_elements:
        if i.find("REVERSED") != -1:
          df_plays.at[idx, 'ReverseDetails'] = play_elements[:play_elements.index(i) + 1]
          play = ". ".join(play_elements[play_elements.index(i) + 1:])
          break

    ############################
    # REPORTING IN AS ELIGIBLE #
    ############################

    # I do not think this contains any useful data so I am going to exclude it.
    if play.find('reported in as eligible') != -1:
      play_elements = play.split(". ")
      for i in play_elements:
        if i.find('reported in as eligible') != -1:
          play = ". ".join(play_elements[play_elements.index(i) + 1:])
          break

    ###########
    # FUMBLES #
    ###########

    if play.find('FUMBLES') != -1:
      fumble_details, play, df_added_rows = extract_fumble_data_run(df_plays, play, idx)
      if not df_added_rows.empty:
        df_before = df_plays.iloc[:idx]
        df_after = df_plays.iloc[idx+1:]
        df_plays = pd.concat([df_before, df_added_rows, df_after], ignore_index=True)
        index_of_last_added_row = idx + len(df_added_rows) - 1
        return clean_run_plays(df_plays, index_of_last_added_row)

      df_plays.at[idx, 'FumbleDetails'] = fumble_details
      play = ". ".join(play)

    # Yardage gained
    yardage = re.findall(yardage_gained, play)
    if len(yardage) > 0:
      df_plays.loc[idx, 'Yardage'] = int(yardage[0])
    else:
      df_plays.loc[idx, 'Yardage'] = 0

    #############
    #  OFFENSE  #
    #############

    # Formation
    Formation = re.findall(formation, play)
    if len(Formation) > 0:
      if Formation[0] == 'Aborted':
        pass
      else:
        df_plays.loc[idx, 'Formation'] = Formation[0]

    # Rusher
    rusher_patterns = [rusher_pattern, run_after_recovery, qb_fumble]
    # Loop through patterns and find the first match
    for pattern in rusher_patterns:
      rusher = re.findall(pattern, play)
      if len(rusher) > 0:
        rusher_name = rusher[0]
        break
    df_plays.loc[idx, 'Rusher'] = rusher_name

    # Direction
    rushing_directions = ['guard', 'middle', 'tackle', 'end', 'kneels']
    for i in rushing_directions:
      if play.find(i) != -1:
        start = play.find(rusher_name) + len(rusher_name) + 1
        end = play.find(i) + len(i)
        df_plays.loc[idx, 'Direction'] = play[start:end]

    #############
    #  DEFENSE  #
    #############

    tackler_1 = re.findall(defense_tackler_1_name_pattern, play) # tackler #1 (Could be solo or the one who initiated the hit)
    if len(tackler_1) > 0:
      df_plays.loc[idx, 'TackleBy1'] = tackler_1[0]
    tackler_2 = re.findall(defense_tackler_2_name_pattern, play) # tackler #2 (equally contributed or assisted with tackle)
    if len(tackler_2) > 0:
      df_plays.loc[idx, 'TackleBy2'] = tackler_2[0]

    ##############
    #  INJURIES  #
    ##############

    injuries = re.findall(injury, play)
    if len(injuries) > 0:
      df_plays.at[idx, 'InjuredPlayers'] = injuries

    #############
    #  PENALTY  #
    #############

    # Accepted Penalty
    if play.find('PENALTY') != -1:
      play_elements = play.split(". ")
      penalties = []
      for i in play_elements:
        if i.find('PENALTY') != -1:
          penalties.append(i)
      df_plays.at[idx, 'AcceptedPenalty'] = penalties

    # Declined Penalty
    if play.find('Penalty') != -1:
      play_elements = play.split(". ")
      penalties = []
      for i in play_elements:
        if i.find('Penalty') != -1:
          penalties.append(i)
      df_plays.at[idx, 'DeclinedPenalty'] = penalties

    # Return if the last play has been cleaned in 'df_run_plays'
    if df_run_plays.tail(1).index.tolist()[0] == idx:
      return df_plays

### INTERCEPTIONS

In [134]:
# PURPOSE:
# - Clean intercepted plays
# INPUT PARAMETERS:
# df_plays    - dataframe - dataframe of plays
# index_start -  integer  - the starting index of the associated input dataframe
#                           to begin cleaning.
# RETURN:
# df_plays - dataframe - dataframe of plays that now has all useful intercepted play
#                        data accessible and clean.

# ROUGH DESGIN
# 1. Narrow dataframe using 'index_start'
#    - This is a recursive method, the narrowing will get smaller and
#      smaller until all 'intercepted' type plays have been cleaned.
# 2. Grab first 'intercepted' play from narrowed dataframe
# 3. Create 2 single row dataframes.
#    a. intended play
#    b. yardage after interception
# 4. Break down play into sentences and clean
#    - Depending on the sentence within the play, will determine which
#      single row dataframe it will go to.
# 5. Combine both dataframes of cleaned data into one dataframe
# 6. Replace old play row with new cleaned multi row
# 7. return clean_interceped_plays( x , y)
#    - x = updated df_plays
#    - y = index directly after the last clean added row

# Concerns:
# ~ 1 ~
# PLAY SNIP - "(9:53) (Shotgun) D.Watson pass short left intended for E.Moore INTERCEPTED by D.Hill (Z.Carter) at CIN 30."
# - The concern here is (Z.Carter)
#   - I do not know what to categorize this player as? I believe that he had an impact on the play and could possibly be a reason
#     that D.Hill was able to intercept the ball.
# ~ 2 ~
# PLAY SNIP - "(4:16) (Shotgun) J.Allen pass deep middle intended for S.Diggs INTERCEPTED by J.Whitehead [Q.Williams] at NYJ -1. Touchback."
# - The concern here is 'touchback'
#   - I have no idea what to do with that
# ~ 3 ~
#`- I do not have anything set in play to handle fumbles? What happens if a QB fumbles, recovers, then throws an interception? -> Then player that intercepted fumbles?
# ~ 4 ~
# - There are 2 rows within this sinlge play. (Intended throwing play, yardage after interception)
#   - For both of these rows that represent a single play, they both state that the throwing team has possession
#     - I do not know how this is going to effect the future with analysis on data

def clean_intercepted_plays(df_plays, index_start = None):

  # Will cut df_plays starting from index_start (narrowing our search space)
  if index_start != None:
    df_plays_adjusted = df_plays.iloc[df_plays.index.tolist().index(index_start):]
    df_intercepted_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('Interception')]
  else:
    df_intercepted_plays = df_plays[df_plays['PlayOutcome'].str.contains('Interception')]

  # Exit case (If no more 'Interception' type plays are found)
  if df_intercepted_plays.empty:
    return df_plays

  # Retrieve the index and 'PlayDescription' of the first interception play in 'df_intercepted_plays'
  # - Process one play per iteration in the recursive method
  idx = df_intercepted_plays.index[0]
  play = df_plays['PlayDescription'].iloc[idx]
  # play = df_plays['PlayDescription'].iloc[df_plays.index.tolist().index(idx)]

  ############
  # REVERSES #
  ############

  # In 'PlayDescription' all information before the "reversed" sentence is not needed.
  # - All information before is stored within 'ReverseDetails' and the remaining is cleaned.
  if play.find('REVERSED') != -1:
    play_elements = play.split(". ")
    for i in play_elements:
      if i.find("REVERSED") != -1:
        df_plays.at[idx, 'ReverseDetails'] = play_elements[:play_elements.index(i) + 1]
        play = ". ".join(play_elements[play_elements.index(i) + 1:])
        break

  # Create 2 single row dataframes.
  # 1. intended play
  df_intended_play = df_plays.iloc[idx].copy()
  df_intended_play = pd.DataFrame([df_intended_play], columns=df_plays.columns)
  df_intended_play.reset_index(drop=True, inplace=True)
  df_intended_play['PlayDescription'] = 'nan'
  # 2. yardage after interception
  df_yardage_after_interception = df_plays.iloc[idx].copy()
  df_yardage_after_interception = pd.DataFrame([df_yardage_after_interception], columns=df_plays.columns)
  df_yardage_after_interception.reset_index(drop=True, inplace=True)
  df_yardage_after_interception['PlayDescription'] = 'nan'

  # break down play by sentences.
  play_elements = play.split(". ")

  # iterate through play_elements
  for i in play_elements:

    ##############################
    # YARDAGE AFTER INTERCEPTION #
    ##############################

    yardage_after_interception = re.findall(run_after_interception, i)
    if len(yardage_after_interception) > 0:
      df_yardage_after_interception['PlayDescription'] = i

      # Player running after interception
      df_yardage_after_interception.loc[0, 'Rusher'] = yardage_after_interception[0]

      # Playtype?
      # - Should this be a new playtype? Something like "RunAfterInterception"?

      # Yardage gained
      yardage = re.findall(yardage_gained, i)
      if len(yardage) > 0:
        df_yardage_after_interception.loc[0, 'Yardage'] = int(yardage[0])
      else:
        df_yardage_after_interception.loc[0, 'Yardage'] = 0

      # Who made tackle
      tackler = re.findall(defense_tackler_1_name_pattern, i)
      if len(tackler) > 0:
        df_yardage_after_interception.loc[0, 'TackleBy1'] = tackler[0]

      continue

    ################################
    # TOUCHDOWN AFTER INTERCEPTION #
    ################################

    touchdown_after_interception_check = re.findall(touchdown_after_interception, i)
    if len(touchdown_after_interception_check) > 0:
      df_yardage_after_interception['PlayDescription'] = i

      # Player running after interception
      df_yardage_after_interception.loc[0, 'Rusher'] = touchdown_after_interception_check[0]

      # Yardage gained
      yardage = re.findall(yardage_gained, i)
      if len(yardage) > 0:
        df_yardage_after_interception.loc[0, 'Yardage'] = int(yardage[0])

      # PlayOutcome
      df_yardage_after_interception.loc[0, 'PlayOutcome'] = 'Touchdown'

      # IsScoringPlay
      df_yardage_after_interception.loc[0, 'IsScoringPlay'] = 1

      continue


    #################
    # INTENDED PLAY #
    #################

    passer_name = re.findall(passer_name_pattern, i)
    if len(passer_name) > 0:
      df_intended_play['PlayDescription'] = i

      # passer
      df_intended_play.loc[0, 'Passer'] = passer_name[0]

      # Play type
      df_intended_play.loc[0, 'PlayType'] = 'Pass'

      # TimeOnTheClock
      TimeOnTheClock = re.findall(time_on_clock_pattern, i)
      if len(TimeOnTheClock) > 0:
        df_intended_play.loc[0, 'TimeOnTheClock'] = TimeOnTheClock[0]

      # Formation
      Formation = re.findall(formation, i)
      if len(Formation) > 0:
          df_intended_play.loc[0, 'Formation'] = Formation[0]

      # Pass Type
      if i.find('deep') != -1:
        df_intended_play.loc[0, 'PassType'] = 'Deep'
      elif i.find('short') != -1:
        df_intended_play.loc[0, 'PassType'] = 'Short'

      # Pass Direction
      if i.find('left') != -1:
        df_intended_play.loc[0, 'Direction'] = 'Left'
      elif i.find('right') != -1:
        df_intended_play.loc[0, 'Direction'] = 'Right'
      elif i.find('middle') != -1:
        df_intended_play.loc[0, 'Direction'] = 'Middle'

      # Receiver
      intended_receiver_name = re.findall(intended_receiver_name_pattern, i)
      if len(intended_receiver_name) > 0:
        df_intended_play.loc[0, 'Receiver'] = intended_receiver_name[0]

      # PressureBy
      pressure_by = re.findall(defense_pressure_name_pattern, i)
      if len(pressure_by) > 0:
        df_intended_play.loc[0, 'PressureBy'] = pressure_by[0]

      # Intercepted by
      intercepted_by = re.findall(interception_name_pattern, i)
      if len(intercepted_by) > 0:
        df_intended_play.loc[0, 'InterceptedBy'] = intercepted_by[0]

      continue

    # - All other data, add to both dataframes
    ##############
    #  INJURIES  #
    ##############

    injuries = re.findall(injury, i)
    if len(injuries) > 0:
      df_intended_play.at[0, 'InjuredPlayers'] = injuries
      df_yardage_after_interception.at[0, 'InjuredPlayers'] = injuries

    #############
    #  PENALTY  #
    #############

    # Look at the value within penalty.
    # if there is nothing there, add a list with the penalty as an element
    # if there is something there, add to that list with the penalty as another element

    # Accepted Penalty
    if i.find('PENALTY') != -1:
      if df_intended_play['AcceptedPenalty'].iloc[0] == 'nan':
        df_intended_play.at[0, 'AcceptedPenalty'] = [i]
        df_yardage_after_interception.at[0, 'AcceptedPenalty'] = [i]
      else:
        df_intended_play.at[0, 'AcceptedPenalty'].append(i)
        df_yardage_after_interception.at[0, 'AcceptedPenalty'].append(i)

    # Declined Penalty
    if i.find('Penalty') != -1:
      if df_intended_play['DeclinedPenalty'].iloc[0] == 'nan':
        df_intended_play.at[0, 'DeclinedPenalty'] = [i]
        df_yardage_after_interception.at[0, 'DeclinedPenalty'] = [i]
      else:
        df_intended_play.at[0, 'DeclinedPenalty'].append(i)
        df_yardage_after_interception.at[0, 'DeclinedPenalty'].append(i)

  # combine both single row dataframes into one
  if df_yardage_after_interception['PlayDescription'].iloc[0] == 'nan':
    df_cleaned_replacement = df_intended_play
  else:
    df_cleaned_replacement = pd.concat([df_intended_play, df_yardage_after_interception], ignore_index=True)

  # Replace old row with new cleaned dataframe
  df_before_row = df_plays.iloc[:idx]
  df_after_row = df_plays.iloc[idx+1:]
  df_plays = pd.concat([df_before_row, df_cleaned_replacement, df_after_row], ignore_index=True)


  # If this is the last play in the dataset
  if df_intercepted_plays.tail(1).index.tolist()[0] == idx:
    return df_plays
  else:
    return clean_intercepted_plays(df_plays, idx+len(df_cleaned_replacement))

### SACKS


In [132]:
def clean_sacked_plays(df_plays, index_start = None):

  # Will cut df_plays starting from index_start (narrowing our search space)
  if index_start != None:
    df_plays_adjusted = df_plays.iloc[df_plays.index.tolist().index(index_start):]
    df_sacked_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('Sack')]
  else:
    df_sacked_plays = df_plays[df_plays['PlayOutcome'].str.contains('Sack')]

  for idx, play in df_sacked_plays['PlayDescription'].items():

    ################
    # Play details #
    ################

    # TimeOnTheClock
    TimeOnTheClock = re.findall(time_on_clock_pattern, play)
    if len(TimeOnTheClock) > 0:
      df_plays.loc[idx, 'TimeOnTheClock'] = TimeOnTheClock[0]


    #############
    #  OFFENSE  #
    #############

    # Formation
    Formation = re.findall(formation, play)
    if len(Formation) > 0:
      if Formation[0] == 'Aborted':
        pass
      else:
        df_plays.loc[idx, 'Formation'] = Formation[0]

    # Sacked Passer
    sacked_passer_name = re.findall(sacked_passer_name_pattern, play)
    if len(sacked_passer_name) > 0:
      df_plays.loc[idx, 'Passer'] = sacked_passer_name[0]

    # Yardage lost
    yardage = re.findall(yardage_gained, play)
    if len(yardage) == 1:
      df_plays.loc[idx, 'Yardage'] = int(yardage[0])

    #############
    #  DEFENSE  #
    #############

    # Solo sack (One person sacked the passer)
    solo_sack = re.findall(defense_tackler_1_name_pattern, play)
    if len(solo_sack) > 0:
      df_plays.loc[idx, 'SackedBy'] = solo_sack[0]

    # Split sack (A sack was given to the passer by multiple defenders)
    split_sack = re.findall(split_sack_pattern, play)
    if len(split_sack) > 0:
      df_plays.at[idx, 'SackedBy'] = list(split_sack[0])

    ##############
    #  INJURIES  #
    ##############

    injuries = re.findall(injury, play)
    if len(injuries) > 0:
      df_plays.at[idx, 'InjuredPlayers'] = injuries

    #############
    #  PENALTY  #
    #############

    # Accepted Penalty
    if play.find('PENALTY') != -1:
      play_elements = play.split(". ")
      penalties = []
      for i in play_elements:
        if i.find('PENALTY') != -1:
          penalties.append(i)
      df_plays.at[idx, 'AcceptedPenalty'] = penalties

    # Declined Penalty
    if play.find('Penalty') != -1:
      play_elements = play.split(". ")
      penalties = []
      for i in play_elements:
        if i.find('Penalty') != -1:
          penalties.append(i)
      df_plays.at[idx, 'DeclinedPenalty'] = penalties

  return df_plays

### TOUCHDOWN PLAYS

In [135]:
def clean_touchdown_plays(df_plays, index_start=None):

  # Cut 'df_plays' to begin from 'index_start' to the last touchdown play available in dataframe
  if index_start != None:
    df_plays_adjusted = df_plays.iloc[df_plays.index.tolist().index(index_start):]
    df_touchdown_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('Touchdown')]
  else:
    df_touchdown_plays = df_plays[df_plays['PlayOutcome'].str.contains('Touchdown')]

  # Iterating through every touchdown play within 'df_touchdown_plays'
  for idx, play in df_touchdown_plays['PlayDescription'].items():

    # - Once i figure out what kind of touchdown it was, then I will be able to
    #   determine the 'PlayType'

    # Still need to clean intercepted play types
    if play.find("INTERCEPTED") != -1:

      # creating a copy of the incercepted touchdown play and cleaning the copy
      intercepted_touchdown_row = df_plays.iloc[idx].copy()
      intercepted_touchdown_row['PlayOutcome'] = 'Interception'
      intercepted_touchdown_row['IsScoringPlay'] = 0 # This will only be the value for the team that threw the interception
      intercepted_touchdown_row = pd.DataFrame([intercepted_touchdown_row], columns=df_plays.columns)
      intercepted_touchdown_row.reset_index(drop=True, inplace=True)
      cleaned_intercepted_touchdown_row = clean_intercepted_plays(intercepted_touchdown_row)

      # Replacing old row with cleaned row
      df_before_row = df_plays.iloc[:idx]
      df_after_row = df_plays.iloc[idx+1:]
      df_plays = pd.concat([df_before_row, cleaned_intercepted_touchdown_row, df_after_row], ignore_index=True)

      # Recursion to update 'df_plays'
      if df_touchdown_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_touchdown_plays(df_plays, idx+len(cleaned_intercepted_touchdown_row))

      continue

    ######################
    # PASSING TOUCHDOWNS #
    ######################

    # If a play has a passer throwing the ball, I am assuming it is a passing play
    passing_play = re.findall(passer_name_pattern, play)
    if len(passing_play) > 0:

      # creating a copy of the passing touchdown play row and cleaning the copy
      passing_touchdown_row = df_plays.iloc[idx].copy()
      passing_touchdown_row['PlayType'] = 'Pass'
      passing_touchdown_row['PlayOutcome'] = 'Pass'
      passing_touchdown_row['IsScoringPlay'] = 1
      passing_touchdown_row = pd.DataFrame([passing_touchdown_row], columns=df_plays.columns)
      cleaned_passing_touchdown_row = clean_pass_plays(passing_touchdown_row)
      cleaned_passing_touchdown_row['PlayOutcome'] = df_plays['PlayOutcome'].iloc[idx]

      # Replacing old row with cleaned row
      df_before_row = df_plays.iloc[:idx]
      df_after_row = df_plays.iloc[idx+1:]
      df_plays = pd.concat([df_before_row, cleaned_passing_touchdown_row, df_after_row], ignore_index=True)

      # Recursion to update 'df_plays'
      if df_touchdown_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_touchdown_plays(df_plays, idx+1)


    ######################
    # RUSHING TOUCHDOWNS #
    ######################

    # Rusher
    rusher_patterns = [rusher_pattern, run_after_recovery]
    # Loop through patterns and find the first match
    for pattern in rusher_patterns:
      rusher = re.findall(pattern, play)
      if len(rusher) > 0:
        # creating a copy of the rushing touchdown play row and cleaning the copy
        rushing_touchdown_row = df_plays.iloc[idx].copy()
        rushing_touchdown_row['PlayType'] = 'Run'
        rushing_touchdown_row['PlayOutcome'] = 'Run'
        rushing_touchdown_row['IsScoringPlay'] = 1
        rushing_touchdown_row = pd.DataFrame([rushing_touchdown_row], columns=df_plays.columns)
        cleaned_rushing_touchdown_row = clean_run_plays(rushing_touchdown_row)
        cleaned_rushing_touchdown_row['PlayOutcome'] = df_plays['PlayOutcome'].iloc[idx]

        # Replacing old row with cleaned row
        df_before_row = df_plays.iloc[:idx]
        df_after_row = df_plays.iloc[idx+1:]
        df_plays = pd.concat([df_before_row, cleaned_rushing_touchdown_row, df_after_row], ignore_index=True)

        # Recursion to update 'df_plays'
        if df_touchdown_plays.tail(1).index.tolist()[0] == idx:
          return df_plays
        else:
          return clean_touchdown_plays(df_plays, idx+1)

## 3. PIPELINE MAIN METHOD

In [130]:
# PURPOSE:
# - Accept a dataframe of plays (dataframes formatted by NFL_Scrapers) and
#   return a cleaned dataframe of those plays.
# INPUT PARAMTERS:
# df_all_plays         - dataframe - all plays in raw form from NFL_Scraper that user
#                                    would like to clean.
# OUTPUT:
# df_all_plays_cleaned - dataframe - all plays from 'df_all_plays' cleaned and data
#                                    dispersed into individual new features.

# CURRENT DESIGN PLAN:
# 1. Use uniquely designed methods for each play type to clean within dataframe
#    - (e.g. pass, run, touchdown, punt, sack, ... )
# 2. Repeat until all plays within dataframe have been cleaned.
#   NOTE:
#   - It is important to fully clean a play type before moving to the next
#      because sometimes cleaning could involve adding a new row to the dataframe,
#      causing a reset to the dataframes indexing.
#      - If we were to separate all play types from the beginning, the indexes
#        could shift around causing, for example, an index that might originally
#        point to a run play to now instead point at a pass play.

# NOTES:
# - I think "PlayOutcomes" is what determines the yardage gained on an intended play?
#   - This does not seem right to me.
#   - EXAMPLE:
#     - (9:54) Bre.Hall left end to BUF 22 for -1 yards (G.Rousseau)
#       FUMBLES (G.Rousseau), ball out of bounds at BUF 25.
#       - I would think that Bre.Hall would get docked -1 yards for his run.
#         - But I believe that he is actually docked -4
#           - 'PlayStart' = 2nd & 9 at BUF 21
#           - The play ends at BUF 25
#             - In my opinion and how I am going to track yardage is based on
#               possession of the ball. So I will track this as -1 yard not -4.

def clean_dataframe_of_plays(df_all_plays):

  ###########################
  # NEW COLUMN DESCRIPTIONS #
  ###########################

  # PlayType           - The type of play (e.g. pass/run)
  # TimeOnTheClock     - The time that was on the clock when the play started
  # Formation          - Play formation
  # Passer             - Player that threw the ball (mostly the quarterback)
  # Rusher             - Player that ran the ball (mostly the runningback)
  # Receiver           - Player on the same team as the passer that caught the ball
  # PassType           - Whether the pass was a deep or short pass?
  # Direction          - Where the ball is going during the play
  # Yardage            - Yards gained during the play
  # TackleBy1          - Main tackler on the play (could be solo or could be with someone else)
  # TackleBy2          - Assisted tackler1
  # PressureBy         - Defender that applied pressure to the passer
  # InterceptedBy      - Defender that intercepted the passing play
  # FumbleDetails      - A list that has what happened after the fumble
  #                      - [forced fumble by, recovered by, yards gained, tackled by]
  # ReverseDetails     - A list having plays leading up to play reversal
  # InjuredPlayers     - Players that were injured during the play
  # PenaltyDescription - If there is a penalty, gives a description of it
  #                      - [who caused the penalty, what was the penalty, yards lost if penalty accepted]

  new_columns = ["PlayType", "TimeOnTheClock", "Formation", "Passer", "Rusher", "Receiver", "PassType", "Direction", "Yardage",
                "TackleBy1", "TackleBy2", "PressureBy", "InterceptedBy", "SackedBy",
                "FumbleDetails", "ReverseDetails",
                "InjuredPlayers", "AcceptedPenalty", "DeclinedPenalty"]

  string_columns = ["PlayType", "TimeOnTheClock", "Formation", "Passer", "Rusher", "Receiver", "PassType", "Direction",
                    "TackleBy1", "TackleBy2", "PressureBy", "InterceptedBy", "SackedBy",
                    "FumbleDetails", "ReverseDetails",
                    "InjuredPlayers", "AcceptedPenalty", "DeclinedPenalty"]

  int_columns = ["Yardage"]

  ########################################
  # RETURN DATAFRAME WITH ADDED FEATURES #
  ########################################

  df_all_plays_cleaned = df_all_plays.copy()
  df_all_plays_cleaned = df_all_plays_cleaned.reindex(columns=df_all_plays_cleaned.columns.tolist() + new_columns)
  df_all_plays_cleaned[string_columns] = df_all_plays_cleaned[string_columns].astype(str)
  df_all_plays_cleaned[int_columns] = df_all_plays_cleaned[int_columns].astype(float)

  ########################################
  # GETTING PLAY CATEGORIES AND CLEANING #
  ########################################

  df_all_plays_cleaned = clean_run_plays(df_all_plays_cleaned)
  df_all_plays_cleaned = clean_pass_plays(df_all_plays_cleaned)
  df_all_plays_cleaned = clean_intercepted_plays(df_all_plays_cleaned)
  df_all_plays_cleaned = clean_touchdown_plays(df_all_plays_cleaned)
  df_all_plays_cleaned = clean_sacked_plays(df_all_plays_cleaned)

  return df_all_plays_cleaned

# TESTING AREA

In [136]:
week1_2023_plays_copy = week1_2023_plays.copy()

df_week1_plays_cleaned = clean_dataframe_of_plays(week1_2023_plays_copy)

In [None]:
df_week1_plays_cleaned.shape

(2628, 32)

In [None]:
# All touchdown plays that still need to be cleaned

df_touchdown_plays = df_week1_plays_cleaned.loc[df_week1_plays_cleaned['PlayOutcome'].str.contains('Touchdown')]

for idx, play in df_touchdown_plays['PlayDescription'].items():
  if play.find("INTERCEPTED") != -1:
    continue
  passing_td = re.findall(passer_name_pattern, play)
  if len(passing_td) > 0:
    continue
  rushing_td = re.findall(rusher_pattern, play)
  if len(rushing_td) > 0:
    continue
  time = re.findall(time_on_clock_pattern, play)
  if len(time) == 0:
    continue
  print(idx)
  print(play)
  print()

155
(9:21) S.Martin punts 42 yards to NYJ 35, Center-R.Ferguson. X.Gipson for 65 yards, TOUCHDOWN.

380
(2:41) (Shotgun) T.Lawrence sacked at JAX 28 for -8 yards (D.Buckner). FUMBLES (D.Buckner) [D.Buckner], recovered by JAX-T.Bigsby at JAX 35. T.Bigsby to JAX 35 for no gain (Z.Franklin). FUMBLES (Z.Franklin), RECOVERED by IND-D.Buckner at JAX 26. D.Buckner for 26 yards, TOUCHDOWN. The Replay Official reviewed the score ruling, and the play was Upheld. The ruling on the field stands.

837
(8:14) G.Gano 45 yard field goal is BLOCKED (J.Thomas), Center-C.Kreiter, Holder-J.Gillan, RECOVERED by DAL-N.Igbinoghene at DAL 42. N.Igbinoghene for 58 yards, TOUCHDOWN.

2502
(1:02) (Shotgun) S.Howell sacked at WAS 12 for -14 yards (D.Gardeck). FUMBLES (D.Gardeck) [D.Gardeck], RECOVERED by ARI-C.Thomas at WAS 2. C.Thomas for 2 yards, TOUCHDOWN.



In [None]:
df_sacked_plays = df_week1_plays_cleaned.loc[df_week1_plays_cleaned['PlayOutcome'].str.contains('Sack')]

for idx, play in df_sacked_plays['PlayDescription'].items():
  print(idx)
  print(play)
  print()

15
(5:10) (No Huddle, Shotgun) J.Allen sacked at NYJ 39 for -9 yards (J.Franklin-Myers).

19
(14:54) (Shotgun) J.Allen sacked at BUF 27 for -2 yards (A.Woods).

41
(:40) (Shotgun) J.Allen sacked at NYJ 23 for -3 yards (Q.Jefferson).

52
(5:00) (Shotgun) J.Allen sacked at NYJ 41 for -3 yards (Q.Jefferson).

60
(13:10) (No Huddle, Shotgun) J.Allen sacked at BUF 23 for -2 yards (J.Johnson).

83
(11:40) (Shotgun) A.Rodgers sacked at NYJ 33 for -10 yards (L.Floyd). NYJ-A.Rodgers was injured during the play. He is Out.

94
(:16) (Shotgun) Z.Wilson sacked at NYJ 32 for -12 yards (J.Phillips).

128
(15:00) (Shotgun) Z.Wilson sacked at NYJ 31 for -1 yards (sack split by L.Floyd and E.Oliver).

190
(:37) (Shotgun) J.Love sacked at CHI 34 for -8 yards (Y.Ngakoue).

265
(9:45) (Shotgun) J.Fields sacked at GB 11 for -7 yards (L.Van Ness).

280
(10:19) J.Fields sacked at CHI 14 for -11 yards (D.Wyatt).

301
(10:52) (No Huddle, Shotgun) J.Fields sacked at CHI 39 for -9 yards (K.Brooks).

327
(14:26) 

In [142]:
df_week1_plays_cleaned.iloc[104]

Unnamed: 0,104
Season,2023
Week,Week 1
Day,MON
Date,09/11
AwayTeam,Bills
HomeTeam,Jets
Quarter,2ND QUARTER
DriveNumber,5
TeamWithPossession,NYJ
IsScoringDrive,0


# cleaned dataset observations

## Home and Away teams (Week 1, 2023)

In [None]:
# Season 2023 Week 1 schedule

df_2023_week1_schedule = df_week1_plays_cleaned[['HomeTeam', 'AwayTeam', 'Season', 'Date', 'Day']].drop_duplicates().sort_values(by='Date').reset_index(drop=True)

df_2023_week1_schedule

Unnamed: 0,HomeTeam,AwayTeam,Season,Date,Day
0,Chiefs,Lions,2023,09/07,THU
1,Bears,Packers,2023,09/10,SUN
2,Colts,Jaguars,2023,09/10,SUN
3,Browns,Bengals,2023,09/10,SUN
4,Giants,Cowboys,2023,09/10,SUN
5,Ravens,Texans,2023,09/10,SUN
6,Saints,Titans,2023,09/10,SUN
7,Broncos,Raiders,2023,09/10,SUN
8,Falcons,Panthers,2023,09/10,SUN
9,Vikings,Buccaneers,2023,09/10,SUN


## Offense Stats

Passing Example
1. Top 10 players who threw the ball the most
2. All passing plays from a specified player
3. Total passing yards from the specified player
4. All receivers who caught a pass from specified player
5. Top target receiver from specified player
6. Top target receiver catching yards

In [None]:
# 1. Top 10 players who threw the ball the most

passers = df_week1_plays_cleaned['Passer'].loc[(df_week1_plays_cleaned['Season'] == 2023) &
                                                (df_week1_plays_cleaned['Week'] == 'Week 1')].value_counts().head(10)

passers

Unnamed: 0_level_0,count
Passer,Unnamed: 1_level_1
,1569
M.Jones,50
K.Pickett,45
T.Tagovailoa,45
K.Cousins,44
C.Stroud,43
J.Allen,41
M.Stafford,38
P.Mahomes,38
J.Fields,37


In [None]:
# 2. All passing plays from a specified player

passer = 'C.Stroud'

df_passing_plays_by = df_week1_plays_cleaned.loc[(df_week1_plays_cleaned['Passer'] == passer)].sort_index()

df_passing_plays_by

Unnamed: 0,Season,Week,Day,Date,AwayTeam,HomeTeam,Quarter,DriveNumber,TeamWithPossession,IsScoringDrive,...,Direction,Yardage,TackleBy1,TackleBy2,PressureBy,InterceptedBy,FumbleDetails,ReverseDetails,InjuredPlayers,PenaltyDescription
1061,2023,Week 1,SUN,09/10,Texans,Ravens,1ST QUARTER,2,HOU,0,...,Middle,0.0,K.Hamilton,,,,,,,
1062,2023,Week 1,SUN,09/10,Texans,Ravens,1ST QUARTER,2,HOU,0,...,Right,0.0,R.Darby,,,,,,,
1064,2023,Week 1,SUN,09/10,Texans,Ravens,1ST QUARTER,4,HOU,0,...,Right,0.0,,,,,,,,
1067,2023,Week 1,SUN,09/10,Texans,Ravens,1ST QUARTER,4,HOU,0,...,Right,7.0,A.Washington,R.Smith,,,,,,
1071,2023,Week 1,SUN,09/10,Texans,Ravens,2ND QUARTER,1,HOU,0,...,Left,0.0,,,O.Oweh,,,,,
1072,2023,Week 1,SUN,09/10,Texans,Ravens,2ND QUARTER,1,HOU,0,...,Right,-1.0,R.Smith,A.Washington,,,,,,
1076,2023,Week 1,SUN,09/10,Texans,Ravens,2ND QUARTER,3,HOU,1,...,Middle,13.0,M.Williams,,,,,,,
1078,2023,Week 1,SUN,09/10,Texans,Ravens,2ND QUARTER,3,HOU,1,...,Middle,0.0,M.Williams,,,,,,,
1079,2023,Week 1,SUN,09/10,Texans,Ravens,2ND QUARTER,3,HOU,1,...,Right,5.0,R.Smith,,O.Oweh,,,,,
1080,2023,Week 1,SUN,09/10,Texans,Ravens,2ND QUARTER,3,HOU,1,...,Left,5.0,M.Harrison,M.Williams,,,,,,


In [None]:
# 3. Total passing yards from the specified player

total_passing_yards = df_passing_plays_by['Yardage'].sum()

total_passing_yards

242.0

In [None]:
# 4. All receivers who caught a pass from specified player

df_all_passing_targets = df_passing_plays_by['Receiver'].loc[(df_passing_plays_by['Receiver'] != 'nan')].value_counts()

df_all_passing_targets

Unnamed: 0_level_0,count
Receiver,Unnamed: 1_level_1
N.Collins,11
R.Woods,10
D.Schultz,4
N.Dell,4
M.Boone,4
D.Pierce,3
N.Brown,3
C.Stroud,1
T.Quitoriano,1
X.Hutchinson,1


In [None]:
# 5. Top target receiver from specified player

df_passers_top_target_plays = df_passing_plays_by.loc[df_passing_plays_by['Receiver'] == df_all_passing_targets.head(1).index.tolist()[0]]

df_passers_top_target_plays

Unnamed: 0,Season,Week,Day,Date,AwayTeam,HomeTeam,Quarter,DriveNumber,TeamWithPossession,IsScoringDrive,...,Direction,Yardage,TackleBy1,TackleBy2,PressureBy,InterceptedBy,FumbleDetails,ReverseDetails,InjuredPlayers,PenaltyDescription
1071,2023,Week 1,SUN,09/10,Texans,Ravens,2ND QUARTER,1,HOU,0,...,Left,0.0,,,O.Oweh,,,,,
1078,2023,Week 1,SUN,09/10,Texans,Ravens,2ND QUARTER,3,HOU,1,...,Middle,0.0,M.Williams,,,,,,,
1083,2023,Week 1,SUN,09/10,Texans,Ravens,2ND QUARTER,3,HOU,1,...,Right,5.0,K.Hamilton,,,,,,,
1084,2023,Week 1,SUN,09/10,Texans,Ravens,2ND QUARTER,3,HOU,1,...,Middle,14.0,M.Williams,,,,,,,
1094,2023,Week 1,SUN,09/10,Texans,Ravens,2ND QUARTER,5,HOU,1,...,Right,0.0,,,,,,,,
1098,2023,Week 1,SUN,09/10,Texans,Ravens,2ND QUARTER,5,HOU,1,...,Right,0.0,,,,,,,,
1112,2023,Week 1,SUN,09/10,Texans,Ravens,3RD QUARTER,3,HOU,0,...,Middle,15.0,R.Darby,,,,,,,
1117,2023,Week 1,SUN,09/10,Texans,Ravens,3RD QUARTER,5,HOU,0,...,Middle,14.0,A.Washington,,,,,,,
1132,2023,Week 1,SUN,09/10,Texans,Ravens,4TH QUARTER,3,HOU,0,...,Left,0.0,,,,,,,,
1133,2023,Week 1,SUN,09/10,Texans,Ravens,4TH QUARTER,3,HOU,0,...,Middle,26.0,R.Darby,G.Stone,,,,,[G.Fant],


In [None]:
# 6. Top target receiver catching yards

df_passers_top_target_plays['Yardage'].sum()

80.0

Rushing Example
1. All players who carried the ball from a specified team
2. All rushing plays from top rusher of a specified team
3. Total rushing yards from top rusher of a specified team


In [None]:
# 1. All players who carried the ball from a specified team
# - I need to map team names to their abbreviations in the future
#   - For right now 'Cowboys' == 'DAL'

team_abbreviation = 'DAL'

team_rushers = df_week1_plays_cleaned['Rusher'].loc[(df_week1_plays_cleaned['TeamWithPossession'] == team_abbreviation) &
                                                    (df_week1_plays_cleaned['Rusher'] != 'nan')].value_counts()

team_rushers

Unnamed: 0_level_0,count
Rusher,Unnamed: 1_level_1
T.Pollard,14
R.Dowdle,6
D.Vaughn,6
S.Barkley,4
D.Jones,4
K.Turpin,3
M.Breida,1
D.Bland,1
D.Prescott,1


In [None]:
# 2. All rushing plays from top rusher of a specified team

df_top_rushers_plays = df_week1_plays_cleaned.loc[df_week1_plays_cleaned['Rusher'] == team_rushers.head(1).index.tolist()[0]]

df_top_rushers_plays

Unnamed: 0,Season,Week,Day,Date,AwayTeam,HomeTeam,Quarter,DriveNumber,TeamWithPossession,IsScoringDrive,...,Direction,Yardage,TackleBy1,TackleBy2,PressureBy,InterceptedBy,FumbleDetails,ReverseDetails,InjuredPlayers,PenaltyDescription
839,2023,Week 1,SUN,09/10,Cowboys,Giants,1ST QUARTER,3,DAL,1,...,right guard,4.0,M.McFadden,A.Jackson,,,,,,
842,2023,Week 1,SUN,09/10,Cowboys,Giants,1ST QUARTER,3,DAL,1,...,right guard,4.0,M.McFadden,B.Okereke,,,,,,
852,2023,Week 1,SUN,09/10,Cowboys,Giants,2ND QUARTER,1,DAL,1,...,left tackle,-2.0,J.Riley,K.Thibodeaux,,,,,,
856,2023,Week 1,SUN,09/10,Cowboys,Giants,2ND QUARTER,1,DAL,1,...,right tackle,2.0,R.Nunez-Roches,,,,,,,
868,2023,Week 1,SUN,09/10,Cowboys,Giants,2ND QUARTER,3,DAL,1,...,left guard,3.0,L.Williams,M.McFadden,,,,,,
869,2023,Week 1,SUN,09/10,Cowboys,Giants,2ND QUARTER,3,DAL,1,...,right end,2.0,,,,,,,,
873,2023,Week 1,SUN,09/10,Cowboys,Giants,2ND QUARTER,5,DAL,0,...,right guard,4.0,D.Lawrence,,,,,,,
877,2023,Week 1,SUN,09/10,Cowboys,Giants,3RD QUARTER,1,DAL,1,...,up the middle,7.0,M.McFadden,,,,,,,
880,2023,Week 1,SUN,09/10,Cowboys,Giants,3RD QUARTER,1,DAL,1,...,right tackle,25.0,T.Hawkins,,,,,,,
884,2023,Week 1,SUN,09/10,Cowboys,Giants,3RD QUARTER,1,DAL,1,...,right tackle,3.0,T.Hawkins,M.McFadden,,,,,,


In [None]:
# 3. Total rushing yards from top rusher of a specified team

df_top_rushers_plays['Yardage'].sum()

70.0

## Defense Stats

1. All defensive plays from a specified team
2. All solo tackles made form the specified team
3. All plays of the player with the most solo tackles

In [None]:
# 1. All defensive plays from a specified team

team_name = 'Jets'
team_abbreviation = 'NYJ'

df_all_game_plays = df_week1_plays_cleaned.loc[(df_week1_plays_cleaned['HomeTeam'] == team_name) |
                                                (df_week1_plays_cleaned['AwayTeam'] == team_name)]

df_all_defensive_plays = df_all_game_plays.loc[df_all_game_plays['TeamWithPossession'] != team_abbreviation]

df_all_defensive_plays

Unnamed: 0,Season,Week,Day,Date,AwayTeam,HomeTeam,Quarter,DriveNumber,TeamWithPossession,IsScoringDrive,...,Direction,Yardage,TackleBy1,TackleBy2,PressureBy,InterceptedBy,FumbleDetails,ReverseDetails,InjuredPlayers,PenaltyDescription
0,2023,Week 1,MON,09/11,Bills,Jets,1ST QUARTER,1,BUF,0,...,,,,,,,,,,
1,2023,Week 1,MON,09/11,Bills,Jets,1ST QUARTER,1,BUF,0,...,Right,7.0,A.Gardner,,,,,,,
2,2023,Week 1,MON,09/11,Bills,Jets,1ST QUARTER,1,BUF,0,...,Left,5.0,Qu.Williams,,,,,,,
3,2023,Week 1,MON,09/11,Bills,Jets,1ST QUARTER,1,BUF,0,...,up the middle,3.0,J.Johnson,J.Franklin-Myers,,,,,,
4,2023,Week 1,MON,09/11,Bills,Jets,1ST QUARTER,1,BUF,0,...,up the middle,2.0,Q.Williams,J.Franklin-Myers,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75,2023,Week 1,MON,09/11,Bills,Jets,4TH QUARTER,6,BUF,1,...,Left,10.0,D.Reed,,,,,,,
76,2023,Week 1,MON,09/11,Bills,Jets,4TH QUARTER,6,BUF,1,...,,0.0,,,,,,,,
77,2023,Week 1,MON,09/11,Bills,Jets,4TH QUARTER,6,BUF,1,...,Left,0.0,,,Q.Jefferson,,,,,
78,2023,Week 1,MON,09/11,Bills,Jets,4TH QUARTER,6,BUF,1,...,Left,0.0,,,,,,,,


In [None]:
# 2. All solo tackles made form the specified team

df_all_solo_tackles = df_all_defensive_plays['TackleBy1'].loc[(df_all_defensive_plays['TackleBy1'] != 'nan') &
                                                              (df_all_defensive_plays['TackleBy2'] == 'nan')].value_counts()

df_all_solo_tackles

Unnamed: 0_level_0,count
TackleBy1,Unnamed: 1_level_1
Qu.Williams,9
D.Reed,8
T.Adams,3
A.Gardner,2
C.Mosley,2
A.Amos,2
J.Sherwood,2
M.Carter,2
D.Harty,1
Q.Williams,1


In [None]:
# 3. All plays of the player with the most solo tackles

df_player_with_most_tackles = df_week1_plays_cleaned.loc[(df_week1_plays_cleaned['TackleBy1'] == df_all_solo_tackles.head(1).index.tolist()[0]) &
                                                         (df_week1_plays_cleaned['TackleBy2'] == 'nan')]

df_player_with_most_tackles

Unnamed: 0,Season,Week,Day,Date,AwayTeam,HomeTeam,Quarter,DriveNumber,TeamWithPossession,IsScoringDrive,...,Direction,Yardage,TackleBy1,TackleBy2,PressureBy,InterceptedBy,FumbleDetails,ReverseDetails,InjuredPlayers,PenaltyDescription
2,2023,Week 1,MON,09/11,Bills,Jets,1ST QUARTER,1,BUF,0,...,Left,5.0,Qu.Williams,,,,,,,
8,2023,Week 1,MON,09/11,Bills,Jets,1ST QUARTER,3,BUF,1,...,right tackle,5.0,Qu.Williams,,,,,,,
17,2023,Week 1,MON,09/11,Bills,Jets,1ST QUARTER,3,BUF,1,...,Left,13.0,Qu.Williams,,,,,,,
20,2023,Week 1,MON,09/11,Bills,Jets,2ND QUARTER,2,BUF,0,...,Left,4.0,Qu.Williams,,,,,,,
33,2023,Week 1,MON,09/11,Bills,Jets,2ND QUARTER,4,BUF,1,...,up the middle,3.0,Qu.Williams,,,,,,,
39,2023,Week 1,MON,09/11,Bills,Jets,2ND QUARTER,6,BUF,1,...,Left,3.0,Qu.Williams,,,,,,,
42,2023,Week 1,MON,09/11,Bills,Jets,2ND QUARTER,6,BUF,1,...,Right,0.0,Qu.Williams,,,,,,,
56,2023,Week 1,MON,09/11,Bills,Jets,3RD QUARTER,4,BUF,0,...,Right,4.0,Qu.Williams,,,,,,,
64,2023,Week 1,MON,09/11,Bills,Jets,4TH QUARTER,2,BUF,0,...,left guard,2.0,Qu.Williams,,,,,,,
