<a href="https://colab.research.google.com/github/KeoniM/NFL_Data_Cleaning/blob/main/NFL_Plays_Week2_2023_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**PURPOSE:**
- Accurately clean a week's worth of play data
  - Season 2023 -> Week 2

**NOTE:**
- What makes version 2 different than version 1 is the data being used. Although the core of the data is identical to the original, NFL.com has updated their formatting of how they display their data which has been scraped and used here. So minor adjustments will have to be made in creating the new version but I also see a beautiful opportunity to clean the older version here. Make the code more readible, organized and efficient.

# MOUNTING AND IMPORTS

In [None]:
# Mount your Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Used to access personal google cloud services
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


In [None]:
# Imports

# Data manipulation
import pandas as pd

# Regular expressions
import re

# Database access
from google.cloud import bigquery

# LOADING DATA (BigQuery)

In [None]:
# Client connect to bigquery project
client = bigquery.Client('nfl-data-430702')

## Season 2023 Week 2

In [None]:
# Grabbing all plays from 2023 Week 2 NFL Sesason
nfl_plays_week2_2023_query = """
                             SELECT *
                             FROM `nfl-data-430702.NFL_Scores_v2.NFL-Plays-Week2_2023`
                             """

# Running psuedo query, and returns the amount of bytes it will take to run query
dry_run_config = bigquery.QueryJobConfig(dry_run=True)
dry_run_query = client.query(nfl_plays_week2_2023_query, job_config=dry_run_config)
print("This query will process {} gigabytes.".format(dry_run_query.total_bytes_processed/10**9))

# Running query (Being mindful of the amount of data being grabbed)
# Will grab a maximum of a Gigabyte
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**9)
safe_config_query = client.query(nfl_plays_week2_2023_query, job_config=safe_config)

This query will process 0.000645655 gigabytes.


In [None]:
# Putting data attained from query into a dataframe
week2_2023_plays = safe_config_query.to_dataframe()

In [None]:
week2_2023_plays.head()

Unnamed: 0,Season,Week,Day,Date,AwayTeam,HomeTeam,Quarter,DriveNumber,TeamWithPossession,IsScoringDrive,PlayNumberInDrive,IsScoringPlay,PlayOutcome,PlayStart,PlayTimeFormation,PlayDescription
0,2023,Week 2,THU,09/14,Vikings,Eagles,1st Quarter,1,Philadelphia Eagles,1,1,0,Kickoff from MIN 35,,Kickoff,— G.Joseph kicks 65 yards from MIN 35 to end z...
1,2023,Week 2,THU,09/14,Vikings,Eagles,1st Quarter,1,Philadelphia Eagles,1,2,0,6 Yard Pass,1st & 10 at PHI 25,15:00 1st Shotgun,— J.Hurts pass short right to D.Smith to PHI 3...
2,2023,Week 2,THU,09/14,Vikings,Eagles,1st Quarter,1,Philadelphia Eagles,1,3,0,-1 Yard Sack,2nd & 4 at PHI 31,14:27 1st Shotgun,— J.Hurts sacked at PHI 30 for -1 yards (sack ...
3,2023,Week 2,THU,09/14,Vikings,Eagles,1st Quarter,1,Philadelphia Eagles,1,4,0,7 Yard Run,3rd & 5 at PHI 30,13:45 1st Shotgun,— J.Hurts scrambles right end pushed ob at PHI...
4,2023,Week 2,THU,09/14,Vikings,Eagles,1st Quarter,1,Philadelphia Eagles,1,5,0,-1 Yard Pass,1st & 10 at PHI 37,13:10 1st Shotgun,— J.Hurts pass short left to D.Goedert to PHI ...


# CATEGORIZE PLAYS
- The goal here is to parse out the different values for 'PlayOutcome'
  - Here is where I will separate different types of plays
    - ( pass / run / kickoff / etc..)

In [None]:
# All play outcomes from the game
# - From here we can categorize and clean plays accordingly
week2_2023_plays['PlayOutcome'].unique()

array(['Kickoff from MIN 35', '6 Yard Pass', '-1 Yard Sack', '7 Yard Run',
       '-1 Yard Pass', '54 Yard Pass', '1 Yard Pass', '1 Yard Run',
       '2 Yard Run', 'Field Goal', 'Kickoff from PHI 35', '15 Yard Pass',
       'Pass Incomplete', 'Punt', '-5 Yard Penalty', '3 Yard Run',
       'Fumble', '4 Yard Run', '12 Yard Run', '-7 Yard Sack',
       'Interception', '0 Yard Run', '5 Yard Pass', '-3 Yard Run',
       'Field Goal No Good', '5 Yard Penalty', '9 Yard Pass',
       '-2 Yard Run', '3 Yard Pass', '24 Yard Pass', '5 Yard Run',
       '7 Yard Pass', 'Touchdown', 'Extra Point', '6 Yard Run',
       '8 Yard Run', '11 Yard Pass', 'Timeout', '13 Yard Pass',
       '4 Yard Pass', '18 Yard Pass', '18 Yard Run', '0 Yard Pass',
       '-5 Yard Pass', '-10 Yard Penalty', '2 Yard Pass', '9 Yard Run',
       '11 Yard Run', '8 Yard Pass', '-2 Yard Sack', '-12 Yard Sack',
       '23 Yard Pass', '22 Yard Pass', '14 Yard Pass', '12 Yard Pass',
       '43 Yard Run', '10 Yard Pass', '16 Yard Pa

In [None]:
# NOTES:
# - Currently, I am eyeing all unique play outcomes to categorizing them.
#   - This type of approach is not flexable because a play outcome can
#     arise that has not been seen yet.
#     - There may be more play outcomes in the future when working on a full season,
#       let alone all seasons and future games

# Play Types with complete cleaning methods (As far as this sample size goes)

# ~ OFFENSE ~
df_2023_pass_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('Pass')]
df_2023_run_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('Run')]
# ~ DEFENSE ~
df_2023_interception_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('Interception')]
df_2023_sack_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('Sack')]
# ~ SPECIAL TEAMS ~
df_2023_punt_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('Punt')]
df_2023_kickoff_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('Kickoff')]
# ~ SCORING ~
df_2023_touchdown_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('Touchdown')]
df_2023_extrapoint_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('Extra Point')]
df_2023_fieldgoal_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('Field Goal')]
# df_2023_2pt_conversion_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('2PT Conversion')]
df_2023_2pt_conversion_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('Conversion')]
# ~ OTHER ~
df_2023_fumble_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('Fumble')]
df_2023_penalty_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('Penalty')]
df_2023_turnover_on_downs_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('Turnover on Downs')]
df_2023_timeout_week2 = week2_2023_plays[week2_2023_plays['PlayOutcome'].str.contains('Timeout')]

## SANITY CHECK (All Plays Accounted for)
  - Once all plays have been categorized, will compare the sum of all plays in each category to the size of the original dataframe of plays.
    - Goal is to make sure the number of plays is the same.

In [None]:
# Categorized plays

plays_list = [df_2023_pass_week2,         # Offense
              df_2023_run_week2,
              df_2023_interception_week2, # Defense
              df_2023_sack_week2,
              df_2023_punt_week2,         # Special Teams
              df_2023_kickoff_week2,
              df_2023_touchdown_week2,    # Scoring
              df_2023_extrapoint_week2,
              df_2023_fieldgoal_week2,
              df_2023_2pt_conversion_week2,
              df_2023_fumble_week2,       # Other
              df_2023_penalty_week2,
              df_2023_turnover_on_downs_week2,
              df_2023_timeout_week2]

num_plays_categorized = 0

for plays in plays_list:
  num_plays_categorized = num_plays_categorized + len(plays)

num_plays_categorized == len(week2_2023_plays)

True

# PIPELINE
- ORDER
  1. Team Dictionary
    - Used to map team names with their acronyms
  2. Regular expressions
    - Used to find common patterns within raw data
  3. Transforming Data
    - So far, only label encoding
  4. Cleaning methods
    - Unique cleaning methods for each play type
  5. Main pipeline method
    - Control flow of cleaning methods

## 1. TEAM DICTIONARY

In [None]:
# KEY: Team name
# VALUE: Acronym of team

dict_teams = {
    'Cardinals': 'ARI', 'Falcons': 'ATL', 'Ravens': 'BAL', 'Bills': 'BUF', 'Panthers': 'CAR', 'Bears': 'CHI',
    'Bengals': 'CIN', 'Browns': 'CLE', 'Cowboys': 'DAL', 'Broncos': 'DEN', 'Lions': 'DET', 'Packers': 'GB',
    'Texans': 'HOU', 'Colts': 'IND', 'Jaguars': 'JAX', 'Chiefs': 'KC', 'Raiders': 'LV', 'Chargers': 'LAC',
    'Rams': 'LAR', 'Dolphins': 'MIA', 'Vikings': 'MIN', 'Patriots': 'NE', 'Saints': 'NO', 'Giants': 'NYG',
    'Jets': 'NYJ', 'Eagles': 'PHI', 'Steelers': 'PIT', '49ers': 'SF', 'Seahawks': 'SEA', 'Buccaneers': 'TB',
    'Titans': 'TEN', 'Commanders': 'WAS'
}

In [None]:
# KEY: Full Team name
# VALUE: Acronym of team

dict_teams_2 = {
    'Arizona Cardinals': 'ARI', 'Atlanta Falcons': 'ATL', 'Baltimore Ravens': 'BAL', 'Buffalo Bills': 'BUF', 'Carolina Panthers': 'CAR', 'Chicago Bears': 'CHI',
    'Cincinnati Bengals': 'CIN', 'Cleveland Browns': 'CLE', 'Dallas Cowboys': 'DAL', 'Denver Broncos': 'DEN', 'Detroit Lions': 'DET', 'Green Bay Packers': 'GB',
    'Houston Texans': 'HOU', 'Indianapolis Colts': 'IND', 'Jacksonville Jaguars': 'JAX', 'Kansas City Chiefs': 'KC', 'Las Vegas Raiders': 'LV', 'Los Angeles Chargers': 'LAC',
    'Los Angeles Rams': 'LAR', 'Miami Dolphins': 'MIA', 'Minnesota Vikings': 'MIN', 'New England Patriots': 'NE', 'New Orleans Saints': 'NO', 'New York Giants': 'NYG',
    'New York Jets': 'NYJ', 'Philadelphia Eagles': 'PHI', 'Pittsburgh Steelers': 'PIT', 'San Francisco 49ers': 'SF', 'Seattle Seahawks': 'SEA', 'Tampa Bay Buccaneers': 'TB',
    'Tennessee Titans': 'TEN', 'Washington Commanders': 'WAS'
}

In [None]:
# KEY: Acronym of team
# VALUE: Team name

dict_teams_3 = {
    'ARI': 'Arizona Cardinals', 'ATL': 'Atlanta Falcons', 'BAL': 'Baltimore Ravens', 'BUF': 'Buffalo Bills', 'CAR': 'Carolina Panthers', 'CHI': 'Chicago Bears',
    'CIN': 'Cincinnati Bengals', 'CLE': 'Cleveland Browns', 'DAL': 'Dallas Cowboys', 'DEN': 'Denver Broncos', 'DET': 'Detroit Lions', 'GB': 'Green Bay Packers',
    'HOU': 'Houston Texans', 'IND': 'Indianapolis Colts', 'JAX': 'Jacksonville Jaguars', 'KC': 'Kansas City Chiefs', 'LV': 'Las Vegas Raiders', 'LAC': 'Los Angeles Chargers',
    'LAR': 'Los Angeles Rams', 'MIA': 'Miami Dolphins', 'MIN': 'Minnesota Vikings', 'NE': 'New England Patriots', 'NO': 'New Orleans Saints', 'NYG': 'New York Giants',
    'NYJ': 'New York Jets', 'PHI': 'Philadelphia Eagles', 'PIT': 'Pittsburgh Steelers', 'SF': 'San Francisco 49ers', 'SEA': 'Seattle Seahawks', 'TB': 'Tampa Bay Buccaneers',
    'TEN': 'Tennessee Titans', 'WAS': 'Washington Commanders'
}

## 2. REGULAR EXPRESSIONS

In [104]:
####################################################
# REGULAR EXPRESSIONS USED TO LOCATE SPECIFIC DATA #
####################################################

###########
# GENERAL #
###########

# Players name (Grabs every variation come across so far)
# - I need this to be able to grab 'A.St. Brown' & 'C.Edwards-Helaire' & 'L.Van Ness'
# - I can imagine that I will have to change this again in the future.
#   - Specifically the 'compound surnames' part

#                                V  V <-> meant to grab initial of first name and compound surnames such as "St." in "A.St. Brown"
#                V   1 name abr   V  V last name V     VV <-> name separator ( - | . )                           V      V <-> last name 2 (such as "Ness" in "L.Van Ness")
#                                                          V          common words that follow name           V
name_pattern = r"(?:[A-Za-z]{1,4}\.)+(?:[A-Za-z]+)?(?:[- ](?!to|pushed|INTERCEPTED|scrambles|for|pass|ran|is|at)[A-Za-z]+)?"

spotting_pattern = "(?:([A-Z]+) )?([0-9]+)"

# Injuries (Returns the player(s) who go injuried during play)
injury_pattern = f"[A-Z]+-({name_pattern}) was injured during the play"

################
# PLAY DETAILS #
################

# Positioning at the end of the play
standard_play_end_pattern = "(?:to|at) (?:([A-Z]+) )?([0-9]+) for (no gain|-?[0-9]+)(?: yards?)?"

###########
# OFFENSE #
###########

# Passer (Player passing, Player spiking, Player who got sacked)
passer_name_pattern = f"({name_pattern}) (?:pass|spiked|sacked)"

# Pass play (Returns intended receiver and the direction of the pass)
receiver_pattern = f"(short|deep) (left|right|middle) (?:to|intended for) ({name_pattern})"

# Rushing play (Player running ball)
rusher_pattern = f"({name_pattern})(?: scrambles)? (?:(left|right|up|kneels)) (?:(the middle|guard|tackle|end))?"

###########
# DEFENSE #
###########

# Tackles

# solo / sack
solo_tackle_pattern = rf"\(({name_pattern})\)"

# shared
shared_tackle_pattern = rf"\(({name_pattern}), ({name_pattern})\)"

# shared
assisted_tackle_pattern = rf"\(({name_pattern}); ({name_pattern})\)"

# Pressure (Who applied pressure to passer)
# - I think it might be possible for multiple defenders to apply pressure to the passer.
defense_pressure_name_pattern = rf"\[({name_pattern})\]"


## 3. TRANSFORMING DATA

In [None]:
# PURPOSE:
# - Take value for 'PlayTimeFormation' and split into 3 separate features.
#   1. GameClock (Will come about when renaming 'PlayTimeFormation')
#   2. Quarter (This feature already exists, the values within 'PlayTimeFormation' are more accurate and will replace the value in here originaly)
#   3. Formation

def playtimeformation_split(df_plays):

  df_plays_copy = df_plays.copy()

  new_columns = ['Formation']

  df_plays_copy = df_plays_copy.rename(columns = {'PlayTimeFormation': 'GameClock'})

  df_plays = df_plays.reindex(columns=df_plays.columns.tolist() + new_columns)

  # Splitting original feauture 'PlayTimeFormation' (Now known as 'TimeLeftInQuarter')
  for idx, play in df_plays_copy['GameClock'].items():
    value_elements = play.split(' ')
    # Some plays (e.g. Kickoff) will only have the formation as a value
    if len(value_elements) <= 1:
      df_plays_copy.at[idx, 'Formation'] = value_elements[0]
      df_plays_copy.at[idx, 'GameClock'] = ""
    else:
      df_plays_copy.at[idx, 'GameClock'] = value_elements[0]
      df_plays_copy.at[idx, 'Quarter'] = value_elements[1]
      df_plays_copy.at[idx, 'Formation'] = " ".join(value_elements[2::])

  # Transform values in 'Quarter' feature from string to integer (e.g. '1st Quarter' -> 1)
  dict_replace_quarter = {'1st Quarter': 1, '2nd Quarter': 2, '3rd Quarter': 3, '4th Quarter': 4,
                          '1st': 1, '2nd': 2, '3rd': 3, '4th': 4}

  # All overtime quarters will be have the value 5 in their place
  df_plays_copy['Quarter'] = df_plays_copy['Quarter'].map(dict_replace_quarter).fillna(5).astype(int)

  return df_plays_copy

# PURPOSE:
# - Take value for 'PlayStart' and split into 2 separate features.
#   1. DownAndDistance (Will come about when renaming 'PlayStart')
#   2. FieldPosition (Start of play)

def playstart_split(df_plays):

  df_plays_copy = df_plays.copy()

  new_columns = ['FieldPosition']

  df_plays_copy = df_plays_copy.rename(columns = {'PlayStart': 'DownAndDistance'})

  df_plays_copy = df_plays_copy.reindex(columns=df_plays_copy.columns.tolist() + new_columns)

  df_plays_copy['FieldPosition'] = df_plays_copy['FieldPosition'].astype(str)

  # Splitting original feature 'PlayStart' (Now known as 'DownAndDistance')
  for idx, play in df_plays_copy['DownAndDistance'].items():
    # Some plays to not have a down and distance or field position and contain 'nan' values here,
    # this is to catcht those plays and keep going. (e.g. Kickoff / Extra Point / etc..)
    if pd.isna(play):
      continue
    else:
      value_elements = play.split(' at ')
      df_plays_copy.at[idx, 'DownAndDistance'] = value_elements[0]
      df_plays_copy.at[idx, 'FieldPosition'] = value_elements[1]

  return df_plays_copy

# PURPOSE:
# - Keep consistence with team names
#   - A team name will always be represented by their acronym

def consistent_team_names(df_plays):

  df_plays_copy = df_plays.copy()

  df_plays_copy['AwayTeam'] = df_plays_copy['AwayTeam'].map(dict_teams)
  df_plays_copy['HomeTeam'] = df_plays_copy['HomeTeam'].map(dict_teams)
  df_plays_copy['TeamWithPossession'] = df_plays_copy['TeamWithPossession'].map(dict_teams_2)

  return df_plays_copy

## 4. CLEANING METHODS

### HELPER CLEANING METHODS

#### YARDAGE BETWEEN SPOTTINGS

In [None]:
# PURPOSE:
# - Calculate the yardage between two spottings


# MOST BENEFICIAL WHERE:
# 1. fumbled plays
# 2. penalty plays

# CONCERNS
# 1. Should I only use this method for plays that absolutely need it?
#    - This seems like it would be a lengthy process having to go through
#      this method for each play.

# WHAT I NEED
# 1. start spotting
# 2. end spotting
# 3. direction to goal

# FEATURES THAT COULD HELP:

# - STRICTLY FOR DIRECITON
#   1. dataframe of plays (NOT IMPLEMENTED IN FIRST ITERATION)
#   2. play index (NOT IMPLEMENTED IN FIRST ITERATION)
#      - might need to reference other plays in the drive or quarter
#      DESIGN NOTE:
#      - The index does not have to be the original from the dataframe of plays,
#        the dataframe of plays does have to be original. I just need to be able
#        to grab features from this play being looked at to reference other plays
#        within the dataframe of plays.

# - BREAD AND BUTTER (most will only need these)
#   3. description of action within play (could be a slice of a single play)
#      - This is where I will find the 'end spotting'.
#      - Some plays will have multiple actions with different yardage gains in them.
#        I need to pinpoint which action I am looking at specifically
#   4. start spotting
#      - Because of the multiple actions nature of some of these plays
#        (fumbles / penalties) I will need to locate the start spotting before
#        hand.
#      DESIGN NOTE:
#      - I may have to cycle regular expressions to find the correct end spotting

# DESIGN MENTALITY:
# - Iterate over time.

def yardage_between_spottings(df_plays, play_index, start_spotting, description_with_end_spotting):

  # DIRECTION
  # - I need to figure out which zone is past the 50 and which zone is within the 50 for the team
  #   with the ball. (e.g. 'BUF' is 100-51, 'KC' is 49-0, 50 is neutral)
  #   - I will find this by looking at
  #     1. start spotting
  #     2. end spotting
  #     3. yardage gained between
  #        - Majority of play descriptions will have the 'end spotting' and
  #          'yardage gained between'. These are essential and if they are not
  #          located within the passed in 'description_with_end_spotting' then
  #          that is when I will need to look at another play within this quarter.

  # DESIGN
  # - Every spotting will have both the zone and the yardage (e.g. 'BUF 20')
  #   - I want all spottings to be on a 100 point scale to represent the length of
  #     the field, the zone will aid in this.
  #     - EXAMPLE:
  #       - 'BUF 20'
  #       - (BUF zone is 100-51)
  #         - 100 - 20 = 80 yards to endzone
  #       - (BUF zone is 49-0)
  #         - 20 yard to endzone
  #   - The reason for doing this is so that I will be able to tell, given 2 spottings,
  #     whether it was a negative gain vs positive.
  #     - EXAMPLE:
  #       - start_spotting = BUF 20
  #       - end_spotting   = BUF 30
  #       - (BUF zone is 100-51)
  #         - start_spotting = 100 - 20 = 80 yards until endzone
  #         - end_spotting   = 100 - 30 = 70 yards until endzone
  #           - yardage gained = 80 - 70 = 10 yards gained
  #       - start_spotting = BUF 20
  #       - end_spotting   = BUF 30
  #       - (BUF zone is 49-0)
  #         - start_spotting = 20 yards until endzone
  #         - end_spotting   = 30 yards until endzone
  #           - yardage gained = 20 - 30 = -10 yards gained

  # LOCATE
  # start_territory
  # start_yardage
  # end_territory
  # end_yardage
  # pseudo_play_yardage

  # Splitting start_spotting (e.x. [['BUF'], ['20']])
  start_elements = re.findall(spotting_pattern, start_spotting)
  start_territory = start_elements[0][0]
  start_yardage = int(start_elements[0][1])

  # Grabbing and splitting end_spotting with intendent play yardage. (e.x. [['BUF'], ['30'], ['10']])
  end_spotting_and_play_yardage = re.findall(standard_play_end_pattern, description_with_end_spotting)
  if end_spotting_and_play_yardage:
    end_territory = end_spotting_and_play_yardage[0][0]
    end_yardage = int(end_spotting_and_play_yardage[0][1])
  else:
    # return 0.0 because the play failed completely. (e.g. pass incomplete)
    return 0

  # Grabbing play_yardage
  # - The yardage here is not always the accurate yardage gained on the play. That
  #   is why we have this method.
  if end_spotting_and_play_yardage[0][2] == 'no gain':
    # return 0.0 because start and end are the same?
    return 0
  else:
    pseudo_play_yardage = int(end_spotting_and_play_yardage[0][2])

  # PLAN ON HOW TO LOCATE ZONES
  # 1. spotting_difference
  #    - ( start_spotting - end_spotting )
  # 2. pseudo_play_yardage
  #    - Was the yardage recorded in the play
  #      description positive or negative?

  # SCHEMATIC? BLUEPRINT? I cant figure out the right word.
  # Standard cases (start position and end position are in the same zone):
  # spotting_difference (+) & pseudo_play_yardage (+):
  # - the start position team zone (49-0)
  # spotting_difference (-) & pseudo_play_yardage (+):
  # - the start position team zone (100-51)
  # spotting_difference (+) & pseudo_play_yardage (-):
  # - the start position team zone (100-51)
  # spotting_difference (-) & pseudo_play_yardage (-):
  # - the start position team zone (49-0)

  # Unique cass (start position and ending position are in different zones):
  # zones switch (e.g. KC 47 -> BUF 47)
  # pseudo_play_yardage (+):
  # - the start position team zone (100-51)
  # pseudo_play_yardage (-):
  # - the start position team zone (49-0)

  # Standard cases
  if (start_territory == end_territory):
    # spotting_difference (+)
    if start_yardage > end_yardage:
      # pseudo_play_yardage (+)
      # starting position 49-0 zone
      if pseudo_play_yardage > 0:
        starting_position = start_yardage
        ending_position = end_yardage
      # pseudo_play_yardage (-)
      # starting position 100-51
      else:
        starting_position = 100 - start_yardage
        ending_position = 100 - end_yardage
    # spotting_difference (-)
    else:
      # pseudo_play_yardage (+)
      # starting position 100-51
      if pseudo_play_yardage > 0:
        starting_position = 100 - start_yardage
        ending_position = 100 - end_yardage
      # pseudo_play_yardage (-)
      # starting position 49-0
      else:
        starting_position = start_yardage
        ending_position = end_yardage
  else:
    # pseudo_play_yardage (+)
    # starting position 100-51
    if pseudo_play_yardage > 0:
      starting_position = 100 - start_yardage
      ending_position = end_yardage
    # pseudo_play_yardage (-)
    # starting position 49-0
    else:
      starting_position = start_yardage
      ending_position = 100 - end_yardage


  # DESIGN CHECK. (Checking for accuracy)
  if pseudo_play_yardage != int(starting_position) - int(ending_position):
    raise ValueError(f"Yardage mismatch at play_index {play_index}, \"{description_with_end_spotting}\"")



  return int(starting_position) - int(ending_position)

### OFFENSIVE CLEANING METHODS

#### PASS PLAYS

In [105]:
# PURPOSE:
# - Clean all passing play types
# INPUT PARAMETERS:
# df_plays    - dataframe - NFL plays
# index_start -  integer  - index in the dataframe of NFL plays where the method
#                           will start cleaning in ascending order.
# RETURN:
# df_plays - dataframe - the same input df_plays but with all passing play types cleaned

def clean_pass_plays(df_plays, index_start = None):

  # Adjusting df_plays to start cleaning at a specified index (index_start)
  if index_start != None:
    # Locating all passing type plays (starting from 'index_start')
    df_plays_adjusted = df_plays.loc[index_start:]
    df_pass_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('Pass')]
  else:
    # Locating all passing type plays (From entire input dataframe)
    df_pass_plays = df_plays[df_plays['PlayOutcome'].str.contains('Pass')]

  for idx, play in df_pass_plays['PlayDescription'].items():

    # print(idx)
    # print(play)

    ################
    # PLAY DETAILS #
    ################

    # Play Type
    df_plays.loc[idx, 'PlayType'] = 'Pass'

    ############
    # REVERSES #
    ############

    # In 'PlayDescription' all information before the "reversed" sentence is not needed.
    # - All information before the "reversed" sentence is stored within "ReverseDetails"
    if play.find('REVERSED') != -1:
      play_elements = play.split(". ")
      for i in play_elements:
        if i.find('REVERSED') != -1:
          df_plays.at[idx, 'ReverseDetails'] = play_elements[:play_elements.index(i) + 1]
          play = ". ".join(play_elements[play_elements.index(i) + 1:])
          break

    ############################
    # REPORTING IN AS ELIGIBLE #
    ############################

    # I do not think this contains any useful data so I am going to exclude it.
    if play.find('reported in as eligible') != -1:
      play_elements = play.split(". ")
      for i in play_elements:
        if i.find('reported in as eligible') != -1:
          play = ". ".join(play_elements[play_elements.index(i) + 1:])
          break

    ############
    # LATERALS #
    ############
    # - Yardage gained from a lateral.. what would this look like?
    #   - Would the lateral method completely clean that play?
    #     - I think so.

    ###########
    # FUMBLES #
    ###########
    # - Yardage gained from a fumble.. what would this look like?
    #   - Would the fumble method completele clean that play?
    #     - I think so.

    ###########
    # OFFENSE #
    ###########

    # These may have to change in the future
    # - I do not think that the value with the 'end_spotting' will always
    #   be 'play'. I think that in the future, I will need to get more percise
    #   with this.
    # - I do not think that 'start_spotting' will always be the field position.
    start_spotting = df_plays.loc[idx, 'FieldPosition']
    description_with_end_spotting = play
    df_plays.loc[idx, 'Yardage'] = yardage_between_spottings(df_plays, idx, start_spotting, description_with_end_spotting)


    # I am not giving up on this option of receiving play yardage
    # VVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVVV
    # - I think it is the fastest and is accurate for normal plays.

    # action_yardage = re.findall(standard_play_end_pattern, play)
    # if action_yardage:
    #   print(action_yardage)
    #   # End Spot
    #   df_plays.loc[idx, 'EndSpot'] = " ".join(action_yardage[0][:2])
    #   # Yardage
    #   if action_yardage[0][2] == 'no gain':
    #     df_plays.loc[idx, 'Yardage'] = 0
    #   else:
    #     df_plays.loc[idx, 'Yardage'] = action_yardage[0][2]
    # else:
    #   print("No action yardage")

    # Passer
    passer_name = re.findall(passer_name_pattern, play)
    if passer_name:
      # print(passer_name)
      df_plays.loc[idx, 'Passer'] = passer_name[0]

    # Receiver name and passing details
    receiver_name_and_passing_details = re.findall(receiver_pattern, play)
    if receiver_name_and_passing_details:
      # print(receiver_name_and_passing_details)
      df_plays.loc[idx, 'Direction'] = " ".join(receiver_name_and_passing_details[0][:2])
      df_plays.loc[idx, 'Receiver'] = receiver_name_and_passing_details[0][2]

    # Unique situation (offense spikes the ball)
    if play.find('spike') != -1:
      df_plays.loc[idx, 'Direction'] = 'spiked' # Direction?

    #############
    #  DEFENSE  #
    #############

    solo_tackle = re.findall(solo_tackle_pattern, play)
    if solo_tackle:
      if df_plays.loc[idx, 'PlayDescription'].find('pass incomplete') != -1:
        df_plays.loc[idx, 'PassDefendedBy'] = solo_tackle[0]
      else:
        df_plays.loc[idx, 'SoloTackle'] = solo_tackle[0]

    shared_tackle = re.findall(shared_tackle_pattern, play)
    if len(shared_tackle) > 0:
      if df_plays.loc[idx, 'PlayDescription'].find('pass incomplete') != -1:
        df_plays.at[idx, 'PassDefendedBy'] = shared_tackle[0]
      else:
        df_plays.at[idx, 'SharedTackle'] = shared_tackle[0]

    assisted_tackle = re.findall(assisted_tackle_pattern, play)
    if len(assisted_tackle) > 0:
      df_plays.at[idx, 'AssistedTackle'] = assisted_tackle[0][::]

    pressure_by = re.findall(defense_pressure_name_pattern, play)
    if len(pressure_by) > 0:
      df_plays.loc[idx, 'PressureBy'] = pressure_by[0]

    ##############
    #  INJURIES  #
    ##############

    injuries = re.findall(injury_pattern, play)
    if len(injuries) > 0:
      df_plays.at[idx, 'InjuredPlayers'] = injuries

    # print()

  if df_pass_plays.tail(1).index.tolist()[0] == idx:
    return df_plays

#### RUN PLAYS

In [107]:
# PURPOSE:
# - Clean run play types
# INPUT PARAMETERS:
# df_plays    - dataframe - dataframe of plays
# index_start -  integer  - the starting index of the associated input dataframe
#                           to begin cleaning.
# RETURN:
# df_plays - dataframe - dataframe of plays that now has all useful run play
#                        data accessable and clean.

# NOTE:
# - This method will be used for all actions that involve running with the football.
#   (e.g. fumble recoveries for yardage, fumble recoveries for touchdown, laterals, etc..)

def clean_run_plays(df_plays, index_start = None):

  if index_start != None:
    df_plays_adjusted = df_plays.loc[index_start:]
    df_run_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('Run')]
  else:
    df_run_plays = df_plays[df_plays['PlayOutcome'].str.contains('Run')]

  # Iterating through every run play within 'df_run_plays'
  for idx, play in df_run_plays['PlayDescription'].items():

    print(idx)
    print(play)

    ################
    # Play details #
    ################

    # Play Type
    df_plays.loc[idx, 'PlayType'] = 'Run'

    ############
    # REVERSES #
    ############

    # In 'PlayDescription' all information before the "reversed" sentence is not needed.
    # - All information before is stored within 'ReverseDetails' and the remaining is cleaned.
    if play.find('REVERSED') != -1:
      play_elements = play.split(". ")
      for i in play_elements:
        if i.find("REVERSED") != -1:
          df_plays.at[idx, 'ReverseDetails'] = play_elements[:play_elements.index(i) + 1]
          play = ". ".join(play_elements[play_elements.index(i) + 1:])
          break

    ############################
    # REPORTING IN AS ELIGIBLE #
    ############################

    # I do not think this contains any useful data so I am going to exclude it.
    if play.find('reported in as eligible') != -1:
      play_elements = play.split(". ")
      for i in play_elements:
        if i.find('reported in as eligible') != -1:
          play = ". ".join(play_elements[play_elements.index(i) + 1:])
          break

    ############
    # LATERALS #
    ############
    # - Yardage gained from a lateral.. what would this look like?
    #   - Would the lateral method completely clean that play?
    #     - I think so.

    ###########
    # FUMBLES #
    ###########
    # - Yardage gained from a fumble.. what would this look like?
    #   - Would the fumble method completele clean that play?
    #     - I think so.
    if play.find('FUMBLES') != -1:
      print()
      continue

    #############
    #  OFFENSE  #
    #############

    # Rusher
    rusher_patterns = [rusher_pattern]
    # Loop through patterns and find the first match
    for pattern in rusher_patterns:
      rusher_name = re.findall(pattern, play)
      if rusher_name:
        print(rusher_name)
        df_plays.loc[idx, 'Rusher'] = rusher_name[0][0]
        # Will need to change this later
        df_plays.at[idx, 'Direction'] = " ".join(rusher_name[0][1::]).strip()
        break

    if not rusher_name:
      raise ValueError(f"rusher not found at {idx}, \"{play}\"")

    # These may have to change in the future
    # - I do not think that the value with the 'end_spotting' will always
    #   be 'play'. I think that in the future, I will need to get more percise
    #   with this.
    # - I do not think that 'start_spotting' will always be the field position.
    start_spotting = df_plays.loc[idx, 'FieldPosition']
    description_with_end_spotting = play
    df_plays.loc[idx, 'Yardage'] = yardage_between_spottings(df_plays, idx, start_spotting, description_with_end_spotting)

    # YARDAGE FOR HANDOFFS? #
    # - That's what was here in the older version. Need to keep an eye out.

    #############
    #  DEFENSE  #
    #############

    solo_tackle = re.findall(solo_tackle_pattern, play)
    if solo_tackle:
        df_plays.loc[idx, 'SoloTackle'] = solo_tackle[0]

    shared_tackle = re.findall(shared_tackle_pattern, play)
    if len(shared_tackle) > 0:
        df_plays.at[idx, 'SharedTackle'] = shared_tackle[0]

    assisted_tackle = re.findall(assisted_tackle_pattern, play)
    if len(assisted_tackle) > 0:
      df_plays.at[idx, 'AssistedTackle'] = assisted_tackle[0][::]

    ##############
    #  INJURIES  #
    ##############

    injuries = re.findall(injury_pattern, play)
    if len(injuries) > 0:
      df_plays.at[idx, 'InjuredPlayers'] = injuries

    #############
    #  PENALTY  #
    #############

    print()

    # Return if the last play has been cleaned in 'df_run_plays'
    if df_run_plays.tail(1).index.tolist()[0] == idx:
      return df_plays

####2PT CONVERSIONS

## 5. PIPELINE MAIN METHOD

In [None]:
# PURPOSE:
# - Accept a dataframe of nfl plays (formatted by NFL_Scrapers) and
#   return a cleaned dataframe of those plays.
# INPUT PARAMETERS:
# df_all_plays         - dataframe - all plays in raw form from NFL_Scraper that user
#                                    would like to clean.
# OUTPUT:
# df_all_plays_cleaned - dataframe - all plays from 'df_all_plays' cleaned and data
#                                    dispersed into individual new features.

# CURRENT DESIGN PLAN:
# 1. Use uniquely designed methods for each play type to clean within dataframe
#    - (e.g. pass, run, touchdown, punt, sack, ... )
# 2. Repeat until all plays within dataframe have been cleaned.
#   NOTE:
#   - It is important to fully clean a play type before moving to the next
#      because sometimes cleaning could involve adding a new row to the dataframe,
#      causing a reset to the dataframes indexing.
#      - If we were to separate all play types from the beginning, the indexes
#        could shift around causing, for example, an index that might originally
#        point to a run play to now instead point at a pass play.

def clean_dataframe_of_plays(df_all_plays):

  # Return Dataframe
  df_all_plays_cleaned = df_all_plays.copy()

  ################################
  # RAW DATA COLUMN DESCRIPTIONS #
  ################################

  # Season             - Year of the season
  # Week               - Game week of the season (e.g. 'Week 1')
  # Day                - Day of the week (e.g. 'MON')
  # Date               - Month and day of the game formatted MM/DD (e.g. '09/07')
  # AwayTeam           - Visiting team of the game
  # HomeTeam           - Home team of the game
  # Quarter            - Quarter that the play is in
  #                      - NOT ACCURATE. Drives that go between quarters will end up
  #                        having all plays in the later quarter.
  # DriveNumber        - Drive number of the quarter that the play is in
  # TeamWithPossession - Team that started with the ball at the beginning of the play.
  # IsScoringDrive     - Does the drive that the focused play in result in a score?
  # PlayNumberInDrive  - Play count in the drive
  # IsScoringPlay      - Did the play result in a score?
  # PlayOutcome        - Ultimate result of the play (e.g. '13 Yard Pass')
  # PlayStart          - The down and where the play started on the field (e.g. '2nd & 9 at DET 21')
  # PlayTimeFormation  - Time left in the quarter / quarter / play formation
  # PlayDescription    - The raw description given of the focused play, entailing everything
  #                      that happened within it.

  #############################################################
  # TRANSFORMING FEATURE VALUES (PREPPING DATA TO BE CLEANED) #
  #############################################################
  df_all_plays_cleaned = playtimeformation_split(df_all_plays_cleaned)
  df_all_plays_cleaned = playstart_split(df_all_plays_cleaned)
  df_all_plays_cleaned = consistent_team_names(df_all_plays_cleaned)

  ######################################
  # NEW ADDITIONAL COLUMN DESCRIPTIONS #
  ######################################

  # ~ General features ~
  # TimeOnTheClock     - NOT HERE ANYMORE.

  # ~ Offensive features ~
  # EndSpot            - Where the end of the play has been spotted
  #                      - This can also be where the end of the action within a play has been spotted.
  # PlayType           - The type of play (e.g. pass/run)
  # Formation          - Play formation
  # Passer             - Player that threw the ball (mostly the quarterback)
  # Rusher             - Player that ran the ball (mostly the runningback)
  # Receiver           - Player on the same team as the passer that caught the ball
  # Direction          - Where the ball is going during the play
  # Yardage            - Yards gained during the play
  #                      - (Should specify that yardage does not include extra yardage gained from penalties)
  #                      - (Player awarded yardage)
  #                      - (also includes how far kicks have gone during kickoffs and punts)

  # ~ Defensive features ~
  # SoloTackle         - Player awarded a solo tackle from a play
  # AssistedTackle     - Player awarded an assisted tackle from a play
  # SharedTackle       - Player awarded a shared tackle from a play
  # PassDefendedBy     - Defender that defended the passing play
  # PressureBy         - Defender that applied pressure to the passer
  # InterceptedBy      - Defender that intercepted the passing play
  # SackedBy           - Player awarded a sack from a play. (Could be solo or split)
  # ForcedFumbledBy    - Player awarded a forced fumble from a play

  # ~ Unique features (uncommon) ~
  # WhoFumbled         - Player who last held the ball during a fumble.
  # FumbleRecoveredBy  - Player who recovered the fumbled ball
  # FumbleDetails      - A list that has what happened after the fumble
  #                      - [forced fumble by, recovered by, yards gained, tackled by]
  # ReverseDetails     - A list having plays leading up to play reversal
  # InjuredPlayers     - Players that were injured during the play
  # AcceptedPenalty    - Penalty on the field that was accepted
  # DeclinedPenalty    - Penalty on the field that was declined

  # ~ Special teams features ~
  # Kicker             - Player who kicked the ball during a kickoff / punt / extra point / field goal
  # LongSnapper        - Player who snapped the ball during a punt / extra point / field goal
  # Returner           - Player who returned the ball during a kickoff / punt
  # DownedBy           - ? ? ? I forget
  # Holder             - Player who held ball for extra point / field goal
  # BlockedBy          - Player who blocked a punt / extra point / field goal

  new_columns = ["EndSpot",
                 "PlayType", "Passer", "Rusher", "Receiver", "Direction", "Yardage",
                 "SoloTackle", "AssistedTackle", "SharedTackle", 'PassDefendedBy', "PressureBy", "InterceptedBy", "SackedBy", "ForcedFumbleBy",
                 "WhoFumbled", "FumbleRecoveredBy", "FumbleDetails", "ReverseDetails", "InjuredPlayers", "AcceptedPenalty", "DeclinedPenalty",
                 "Kicker", "LongSnapper", "Returner", "DownedBy", "Holder", "BlockedBy"]

  string_columns = ["EndSpot",
                    "PlayType", "Passer", "Rusher", "Receiver", "Direction",
                    "SoloTackle", "AssistedTackle", "SharedTackle", 'PassDefendedBy', "PressureBy", "InterceptedBy", "SackedBy", "ForcedFumbleBy",
                    "WhoFumbled", "FumbleRecoveredBy", "FumbleDetails", "ReverseDetails", "InjuredPlayers", "AcceptedPenalty", "DeclinedPenalty",
                    "Kicker", "LongSnapper", "Returner", "DownedBy", "Holder", "BlockedBy"]

  int_columns = ["Yardage"]

  ########################################
  # RETURN DATAFRAME WITH ADDED FEATURES #
  ########################################

  df_all_plays_cleaned = df_all_plays_cleaned.reindex(columns=df_all_plays_cleaned.columns.tolist() + new_columns)
  df_all_plays_cleaned[string_columns] = df_all_plays_cleaned[string_columns].astype(str)
  df_all_plays_cleaned[int_columns] = df_all_plays_cleaned[int_columns].astype(float)

  # #############################################################
  # # TRANSFORMING FEATURE VALUES (PREPPING DATA TO BE CLEANED) #
  # #############################################################
  # df_all_plays_cleaned = playtimeformation_split(df_all_plays_cleaned)
  # df_all_plays_cleaned = playstart_split(df_all_plays_cleaned)
  # df_all_plays_cleaned = consistent_team_names(df_all_plays_cleaned)

  ########################################
  # GETTING PLAY CATEGORIES AND CLEANING #
  ########################################
  df_all_plays_cleaned = clean_pass_plays(df_all_plays_cleaned)
  df_all_plays_cleaned = clean_run_plays(df_all_plays_cleaned)



  return df_all_plays_cleaned

# TESTING

In [108]:
df_week2_plays_cleaned = clean_dataframe_of_plays(week2_2023_plays)

3
— J.Hurts scrambles right end pushed ob at PHI 37 for 7 yards (H.Phillips). MIN-J.Metellus was injured during the play.
[('J.Hurts', 'right', 'end')]

7
— D.Swift right guard to MIN 8 for 1 yard (H.Phillips; D.Wonnum).
[('D.Swift', 'right', 'guard')]

8
— J.Hurts right tackle to MIN 6 for 2 yards (J.Hicks, C.Bynum).
[('J.Hurts', 'right', 'tackle')]

19
— J.Hurts scrambles right tackle to PHI 14 for 3 yards (D.Wonnum, J.Hicks).
[('J.Hurts', 'right', 'tackle')]

21
— D.Swift left guard to PHI 34 for 7 yards (H.Smith).
[('D.Swift', 'left', 'guard')]

22
— D.Swift right guard to PHI 38 for 4 yards (C.Bynum, I.Pace).
[('D.Swift', 'right', 'guard')]

23
— D.Swift right end to 50 for 12 yards (J.Hicks).
[('D.Swift', 'right', 'end')]

28
— J.Hurts left end to MIN 39 for no gain (J.Hicks).
[('J.Hurts', 'left', 'end')]

30
— J.Hurts left tackle to MIN 37 for -3 yards (J.Hicks, D.Wonnum).
[('J.Hurts', 'left', 'tackle')]

36
— A.Mattison right guard to PHI 44 for -2 yards (J.Carter).
[('A.Mattis

In [None]:
df_week2_plays_cleaned.shape

(2752, 46)

In [None]:
df_week2_plays_cleaned

Unnamed: 0,Season,Week,Day,Date,AwayTeam,HomeTeam,Quarter,DriveNumber,TeamWithPossession,IsScoringDrive,...,ReverseDetails,InjuredPlayers,AcceptedPenalty,DeclinedPenalty,Kicker,LongSnapper,Returner,DownedBy,Holder,BlockedBy
0,2023,Week 2,THU,09/14,MIN,PHI,1,1,PHI,1,...,,,,,,,,,,
1,2023,Week 2,THU,09/14,MIN,PHI,1,1,PHI,1,...,,,,,,,,,,
2,2023,Week 2,THU,09/14,MIN,PHI,1,1,PHI,1,...,,,,,,,,,,
3,2023,Week 2,THU,09/14,MIN,PHI,1,1,PHI,1,...,,,,,,,,,,
4,2023,Week 2,THU,09/14,MIN,PHI,1,1,PHI,1,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2747,2023,Week 2,MON,09/18,CLE,PIT,4,6,CLE,0,...,,,,,,,,,,
2748,2023,Week 2,MON,09/18,CLE,PIT,4,6,CLE,0,...,,,,,,,,,,
2749,2023,Week 2,MON,09/18,CLE,PIT,4,6,CLE,0,...,,,,,,,,,,
2750,2023,Week 2,MON,09/18,CLE,PIT,4,7,PIT,0,...,,,,,,,,,,


# PLAYTYPE OBSERVATIONS

In [None]:
# Modifying plays to match cleaned plays transformed features
# ( e.g. Quarter(original) = '1st Quarter
#        Quarter(transform) = 1 )
# - This is needed in order to match plays from the original dataframe
#   to the cleaned dataframe.
df_week2_plays_modified = week2_2023_plays.copy()

df_week2_plays_modified = playtimeformation_split(df_week2_plays_modified)
df_week2_plays_modified = playstart_split(df_week2_plays_modified)
df_week2_plays_modified = consistent_team_names(df_week2_plays_modified)

## HELPER METHOD

In [None]:
# PURPOSE:
# - A tool that can be used to compare original plays and their cleaned versions

# I would like to return a map that has:
# KEY: index of original unclean play
# VALUE: index(es) of cleaned play

def unclean_to_clean_play_matches(df_unclean_plays, df_clean_plays):

  my_map = {}

  # This list of features is unique to each play
  # - Both the unclean and cleaned versions of the plays have these same features, therefore
  #   they will be used to match unclean plays in 'df_unclean_plays' to clean plays in 'df_clean_plays'
  matching_features = ['Season', 'Week', 'Date', 'AwayTeam', 'HomeTeam', 'Quarter', 'DriveNumber', 'PlayNumberInDrive']

  # Iterate through each row of the unclean plays dataframe
  for u_row in df_unclean_plays.itertuples(index=True):
    u_features = [getattr(u_row, col) for col in matching_features]

    matching_indexes = []
    matches_found = False

    # Iterate through each row of the dataframe of cleaned plays
    # - The starting index will be the index of the unclean play within the main original dataframe of plays
    #   - The matching cleaned pair will either be at the exact same location or higher
    for c_row in df_clean_plays[u_row.Index::].itertuples(index=True):
      c_features = [getattr(c_row, col) for col in matching_features]

      # If a match is found, check for consective rows of matches because some uncleaned plays needed to be cleaned using multiple rows
      # - Once a row that does not match follows one that does, will break the loop because the one play match has been found.
      if u_features == c_features:
        matching_indexes.append(c_row.Index)
        matches_found = True
      elif matches_found:
        my_map[u_row.Index] = matching_indexes
        break

  return my_map

## PASSING PLAYS

In [None]:
# All passing plays
df_unclean_pass_plays = df_week2_plays_modified.loc[df_week2_plays_modified['PlayOutcome'].str.contains('Pass')]

map_unclean_clean_pass_plays = unclean_to_clean_play_matches(df_unclean_pass_plays, df_week2_plays_cleaned)

len(map_unclean_clean_pass_plays.keys())

1045

In [None]:
# Every unclean passing play and their associated cleaned play breakdown

for i in map_unclean_clean_pass_plays.keys():
  print(f"({i}, {map_unclean_clean_pass_plays.get(i)})")
  play = df_week2_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

(1, [1])
— J.Hurts pass short right to D.Smith to PHI 31 for 6 yards (B.Murphy; C.Bynum).

(4, [4])
— J.Hurts pass short left to D.Goedert to PHI 36 for -1 yards (C.Bynum).

(5, [5])
— J.Hurts pass deep right to D.Smith to MIN 10 for 54 yards (Th.Jackson).

(6, [6])
— J.Hurts pass short left to A.Brown to MIN 9 for 1 yard (Th.Jackson, C.Bynum).

(11, [11])
— K.Cousins pass deep right to J.Jefferson to MIN 40 for 15 yards (D.Slay).

(12, [12])
— K.Cousins pass short right to J.Jefferson to MIN 41 for 1 yard (D.Slay).

(13, [13])
— K.Cousins pass incomplete short right to A.Mattison [J.Sweat].

(14, [14])
— K.Cousins pass incomplete short middle to K.Osborn [J.Davis].

(16, [16])
— J.Hurts pass incomplete deep right to D.Goedert.

(17, [17])
— J.Hurts pass short right to D.Goedert to PHI 16 for 6 yards (H.Smith).

(25, [25])
— J.Hurts pass short left to D.Goedert to PHI 44 for 1 yard (Th.Jackson, A.Evans).

(29, [29])
— J.Hurts pass short left to A.Brown to MIN 34 for 5 yards (J.Hicks).


## RUN PLAYS

In [100]:
# All rushing plays
df_unclean_run_plays = df_week2_plays_modified.loc[df_week2_plays_modified['PlayOutcome'].str.contains('Run')]

map_unclean_clean_run_plays = unclean_to_clean_play_matches(df_unclean_run_plays, df_week2_plays_cleaned)

len(map_unclean_clean_run_plays.keys())

803

In [None]:
# Every unclean run play and their associated cleaned play breakdown

for i in map_unclean_clean_run_plays.keys():
  print(f"({i}, {map_unclean_clean_run_plays.get(i)})")
  play = df_week2_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

(3, [3])
— J.Hurts scrambles right end pushed ob at PHI 37 for 7 yards (H.Phillips)
MIN-J.Metellus was injured during the play.

(7, [7])
— D.Swift right guard to MIN 8 for 1 yard (H.Phillips; D.Wonnum).

(8, [8])
— J.Hurts right tackle to MIN 6 for 2 yards (J.Hicks, C.Bynum).

(19, [19])
— J.Hurts scrambles right tackle to PHI 14 for 3 yards (D.Wonnum, J.Hicks).

(21, [21])
— D.Swift left guard to PHI 34 for 7 yards (H.Smith).

(22, [22])
— D.Swift right guard to PHI 38 for 4 yards (C.Bynum, I.Pace).

(23, [23])
— D.Swift right end to 50 for 12 yards (J.Hicks).

(28, [28])
— J.Hurts left end to MIN 39 for no gain (J.Hicks).

(30, [30])
— J.Hurts left tackle to MIN 37 for -3 yards (J.Hicks, D.Wonnum).

(36, [36])
— A.Mattison right guard to PHI 44 for -2 yards (J.Carter).

(39, [39])
— A.Mattison left end to PHI 12 for 5 yards (A.Maddox)
PHI-A.Maddox was injured during the play
He is Out.

(44, [44])
— D.Swift right guard to PHI 32 for 7 yards (I.Pace).

(45, [45])
— D.Swift right guar

## INDEX SEARCHING

In [103]:
df_week2_plays_cleaned.iloc[115]
# df_week2_plays_cleaned.iloc[2745]

Unnamed: 0,115
Season,2023
Week,Week 2
Day,THU
Date,09/14
AwayTeam,MIN
HomeTeam,PHI
Quarter,3
DriveNumber,1
TeamWithPossession,PHI
IsScoringDrive,0
