<a href="https://colab.research.google.com/github/KeoniM/NFL_Data_Cleaning/blob/main/NFL_Plays_Week1_2023.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**PURPOSE:**
- Accurately clean a week's worth of play data
  - Season 2023 -> Week 1

**THOUGHTS, CONCERNS AND IDEAS FOR LATER:**

*General*

1. Players with the same name
  - I do think that the raw data has naming conventions to decipher between two players with the exact same name but not 100% sure.
2. Cleaning check (TESTING)
  - I need some type of method that will help decern whether these plays have been cleaned correctly. Currently I am manually checking but this is not sustainable or efficient.
    - **IDEA:** Cross reference recorded NFL stats with stats here and compare likeness. (maybe return a df that highlights differences?)
3. Adjust features (PlayOutcomes/PlayTypes/IsScoringDrive/etc...) for plays that have been split up into multiple rows (Fumble Recoveries, Interceptions, etc...).
  - EXAMPLE: Running back fumbles on a run play but recovers it and rushes for x yards.
    - This would still count towards his rushing yards.
    - 'PlayType' = 'Run'
    - 'PlayOutcome' = 'X Yard Run'
      - 2 rows will be present for this type of play. 1 before fumble and 1 after fumble. Each will have their own separate 'PlayOutcome'..?
  - EXAMPLE: Any fumble recovery that is not the runningback on an intended running play
    - This would not count as rushing yards for the player who recovered the fumble.
    - 'PlayType' = 'Fumble Return'..?
    - 'PlayOutcome' = 'X Yard Fumble Return'..?
  - EXAMPLE: If a team throws an interception and that interception results in a touchdown for the opposing team, I do not think it should be considered as a 'scoring drive' for the team that threw the interception.
    - IDEA: For the category "isScoringDrive" the categories could be:
      1. 0 - Is not a scoring drive
      2. 1 - Is scoring drive for team on offense
      3. 2 - Is scoring drive for team on defense
  - When a play is split up into multiple rows, should each row have the starting formation of the play or should the initial starting row of the play have the formation?
  - IDEA: Should I broaden 'playtypes' to include:
    1. yardage after fumble (Currently have it as 'Run' playtype)
    2. yardage after interception (Currently have it as 'Interception')
4. Condense features.
  - For plays such as punt or kickoff, maybe I can group together data such as who is the longsnapper, holder and kicker instead of representing them on their own.
5. Condense regular expressions to grab multiple pieces of wanted data instead of individual.
6. Use 'Fuzzywuzzy' to find like play outcomes.
  - This will give me a chance to automate play types instead of eying them and seaparating them manually.
    - Not sure if I will actually need this?
7. Map team name with their abbreviations ( e.g. "Cowboys" <-> "DAL" )
  - Maybe with larger datasets with multiple weeks, I can map team names with team abbrevations that match up the most.
8. Shorten cleaning methods by creating a helper method to grab data from the defense on a play
9. Add features to break down penalty plays. This might be beneficial to get more detailed on.
10. Punt and kickoff returns are practically identical. Try to find a regular expression that will capture them both AND catch touchdown plays too.
11. I am realizing that NFL.com does not have every single play within a game. I am missing an extra point after a touchdown.
12. When a player gets sacked, there is no way to determine what type of play the offense was going for (pass/run) based off of the play description.
13. If a drive ends with an interception for a touchdown, every play within the drive will say that 'TeamWithPossession' is the other team.
14. Not all plays are recorded. There are some plays missing.
15. Make sure to add a feature descriptions for raw untouched data

*Offense*

1. Trick plays
  - Need a larger sample size that contains more trick plays
2. Latterals
  - Need a larger sample size that contains more latterals
    - (Only one has been found within the dataset "Season 2023 Week 1", it was handled for that specific play type but have not implement for all)
      - IDEA: Make a new helper method to handle "Handoff" plays.
        - Should I make a new feature for handoffs? Like a feature that links one action to another? Would that be valuable?

*Defense*

1. Nuance of players recorded for sacks & forced fumbles
  - Look under sack play type cleaning method
    - The formatting of multiple defending players in on a fumbled play may cause wrong recording of data (e.i. player who assisted in tackle may be credited for the forced fumble)

2. DEFENSIVE STATS ARE CURRENTLY WRONG
  - I will work on 0.5 tackes, solo tackles and assists. I need to adjust cleaning methods to collect this data better.
    - ';' means solo and assisted tackle
    - ',' means 0.5 tackle
  - Need to figure out when players a noticed for good coverage? Assisting in an interception (I think this is what the play descriptions are stating?)
3. Safety
  - I have not come across safeties yet.

# MOUNTING AND IMPORTS

In [1]:
# Mount your Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# Used to access personal google cloud services
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


In [3]:
# Imports

# Data manipulation
import pandas as pd

# Regular expressions
import re

# Grab data from database
from google.cloud import bigquery

In [4]:
# # debugger (maybe use in the future)
# %pdb on

# LOADING DATA (BigQuery queries)

In [5]:
# Client connect to bigquery project
client = bigquery.Client('nfl-data-430702')

## Season 2023 Week 1

In [6]:
# Grabbing all plays from 2023 Week 1 NFL Sesason
week1_2023_plays_query = """
                         SELECT *
                         FROM `nfl-data-430702.NFL_Scores.NFL-Plays-Week1_2023`
                         """

# Running psuedo query, and returns the amount of bytes it will take to run query
dry_run_config = bigquery.QueryJobConfig(dry_run=True)
dry_run_query = client.query(week1_2023_plays_query, job_config=dry_run_config)
print("This query will process {} bytes.".format(dry_run_query.total_bytes_processed))

# Running query (Being mindful of the amount of data being grabbed)
# Will grab a maximum of a Gigabyte
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**9)
safe_config_query = client.query(week1_2023_plays_query, job_config=safe_config)

This query will process 570194 bytes.


In [7]:
# Putting data attained from query into a dataframe
week1_2023_plays = safe_config_query.to_dataframe()

In [8]:
week1_2023_plays.head()

Unnamed: 0,Season,Week,Day,Date,AwayTeam,HomeTeam,Quarter,DriveNumber,TeamWithPossession,IsScoringDrive,PlayNumberInDrive,IsScoringPlay,PlayOutcome,PlayDescription,PlayStart
0,2023,Week 1,MON,09/11,Bills,Jets,1ST QUARTER,1,BUF,0,1,0,Kickoff,G.Zuerlein kicks 65 yards from NYJ 35 to end z...,Kickoff from NYJ 35
1,2023,Week 1,MON,09/11,Bills,Jets,1ST QUARTER,1,BUF,0,2,0,7 Yard Pass,(15:00) (Shotgun) J.Allen pass short right to ...,1st & 10 at BUF 25
2,2023,Week 1,MON,09/11,Bills,Jets,1ST QUARTER,1,BUF,0,3,0,5 Yard Pass,"(14:34) (No Huddle, Shotgun) J.Allen pass shor...",2nd & 3 at BUF 32
3,2023,Week 1,MON,09/11,Bills,Jets,1ST QUARTER,1,BUF,0,4,0,3 Yard Run,(14:01) J.Cook up the middle to BUF 40 for 3 y...,1st & 10 at BUF 37
4,2023,Week 1,MON,09/11,Bills,Jets,1ST QUARTER,1,BUF,0,5,0,2 Yard Run,(13:24) (Shotgun) J.Cook up the middle to BUF ...,2nd & 7 at BUF 40


In [9]:
# Noting the original size of the raw uncleaned dataframe of data
# - (rows, columns)
week1_2023_plays.shape

(2600, 15)

# CATEGORIZE PLAYS
- The goal here is to parse out the different values for 'PlayOutcome'
  - This is where I will separate different types of plays
    - ( pass / run / kickoff / etc. )

In [10]:
# Maybe try to fuzzywuzzy this in the future?
# - I need to narrow these down into basic categories.
# - (Take away numbers & "Yard")
# - Find the most common words between all outcomes (hoping to get all categories e.i. 'Pass', 'Run', 'Touchdown', etc...)

# All play outcomes from the game
# - From here we can categorize and clean plays accordingly
week1_2023_plays['PlayOutcome'].unique()

array(['Kickoff', '7 Yard Pass', '5 Yard Pass', '3 Yard Run',
       '2 Yard Run', 'Pass Incomplete', 'Punt', '-5 Yard Penalty',
       '5 Yard Run', '1 Yard Pass', '14 Yard Run', '3 Yard Pass',
       '8 Yard Run', '6 Yard Pass', '15 Yard Pass', '-9 Yard Sack',
       '4 Yard Pass', '13 Yard Pass', 'Field Goal', '-2 Yard Sack',
       'Interception', '-5 Yard Run', '18 Yard Pass', '8 Yard Pass',
       '6 Yard Run', '12 Yard Run', '-1 Yard Run', '26 Yard Pass',
       'Touchdown Bills', 'Extra Point Good', '13 Yard Run',
       '-3 Yard Sack', '7 Yard Run', '9 Yard Pass', '4 Yard Run',
       'Fumble', '-10 Yard Penalty', '10 Yard Pass', '26 Yard Run',
       '5 Yard Penalty', '-10 Yard Sack', '22 Yard Pass', '-4 Yard Run',
       '-12 Yard Sack', '83 Yard Run', '1 Yard Run', '2 Yard Pass',
       '10 Yard Run', 'Run for No Gain', '12 Yard Pass', '20 Yard Pass',
       '9 Yard Run', '-2 Yard Pass', 'Sack', '24 Yard Pass',
       '14 Yard Pass', 'Touchdown Jets', '-3 Yard Run', '-2 Yar

In [11]:
# NOTES:
# - Currently, I am eyeing at all unique play outcomes to categorizing them.
#   - This type of approach is not flexable because a play outcome can
#     arise that has not been seen yet.
#     - There may be more play outcomes in the future when working on a full season,
#       let alone all seasons and future games

# Play Types with complete cleaning methods (As far as this sample size goes)

# ~ OFFENSE ~
df_2023_pass_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Pass')]
df_2023_run_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Run')]
# ~ DEFENSE ~
df_2023_interception_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Interception')]
df_2023_sack_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Sack')]
# ~ SPECIAL TEAMS ~
df_2023_punt_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Punt')]
df_2023_kickoff_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Kickoff')]
# ~ SCORING ~
df_2023_touchdown_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Touchdown')]
df_2023_extrapoint_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Extra Point')]
df_2023_fieldgoal_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Field Goal')]
df_2023_2pt_conversion_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('2PT Conversion')]
# ~ OTHER ~
df_2023_fumble_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Fumble')]
df_2023_penalty_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Penalty')]
df_2023_turnover_on_downs_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Turnover on Downs')]


## SANITY CHECK (All Plays Accounted for)
  - Once all plays have been categorizing, will compare the sum of all plays within each category to the size of the original dataframe of plays.
    - Goal is to make sure the number of plays is the same.

In [12]:
# Categorized plays

plays_list = [df_2023_pass_week1,         # Offense
              df_2023_run_week1,
              df_2023_interception_week1, # Defense
              df_2023_sack_week1,
              df_2023_punt_week1,         # Special Teams
              df_2023_kickoff_week1,
              df_2023_touchdown_week1,    # Scoring
              df_2023_extrapoint_week1,
              df_2023_fieldgoal_week1,
              df_2023_2pt_conversion_week1,
              df_2023_fumble_week1,       # Other
              df_2023_penalty_week1,
              df_2023_turnover_on_downs_week1]

num_plays_categorized = 0

for plays in plays_list:
  num_plays_categorized = num_plays_categorized + len(plays)

num_plays_categorized == len(week1_2023_plays)

True

# HELPER METHODS (personal use)
- For personal use, does not actually take part in cleaning dataset at all.

In [13]:
# PURPOSE:
# - Quick look at a section of plays
#   - Ideally the plays that the user wants to break down and clean.
# INPUT PARAMETERS:
# df_all_plays      - DataFrame - The original dataframe where the desired plays to view came from
# df_section_plays  - DataFrame - A section of the original dataframe the user wants to view
# RETURN:
# - Printing to the console:
#   1. index of play
#   2. 'PlayDescription' feature of play
#   3. 'PlayOutcome' feature of play
def print_plays(df_all_plays, df_section_plays):
  for idx, value in df_section_plays['PlayOutcome'].items():
    play = df_all_plays['PlayDescription'].iloc[idx]
    print("index:" + str(idx))
    for i in play.split(". "):
      print(i)
    print(value)
    print()

In [14]:
# EXAMPLE: Displaying all touchdown plays within dataset

print_plays(week1_2023_plays, df_2023_touchdown_week1)

index:33
(4:51) (Shotgun) J.Allen pass short right to S.Diggs for 5 yards, TOUCHDOWN.
Touchdown Bills

index:134
(4:58) Z.Wilson pass short left to G.Wilson for 3 yards, TOUCHDOWN.
Touchdown Jets

index:152
(9:21) S.Martin punts 42 yards to NYJ 35, Center-R.Ferguson
X.Gipson for 65 yards, TOUCHDOWN.
Touchdown Jets

index:163
(6:14) (Shotgun) J.Love pass short middle to R.Doubs for 8 yards, TOUCHDOWN.
Touchdown Packers

index:197
(10:23) (Shotgun) A.Jones right guard for 1 yard, TOUCHDOWN.
Touchdown Packers

index:202
(6:34) (Shotgun) J.Love pass short middle to A.Jones for 35 yards, TOUCHDOWN
GB-A.Jones was injured during the play
His return is Questionable.
Touchdown Packers

index:214
(13:34) J.Love pass short left to R.Doubs for 4 yards, TOUCHDOWN.
Touchdown Packers

index:219
(12:53) (Shotgun) J.Fields pass short middle intended for D.Mooney INTERCEPTED by Q.Walker [K.Clark] at CHI 37
Q.Walker for 37 yards, TOUCHDOWN
PENALTY on GB-R.Douglas, Unsportsmanlike Conduct, 15 yards, enfor

# PIPELINE
  - ORDER
    1. Team Dictionary
      - Used to map team names with thier acronyms
    2. Regular expressions
      - Used to find common patterns within raw data
    3. Cleaning methods
      - Unique cleaning methods for each play type
    4. Main pipeline method
      - Control flow of cleaning methods



## 1. TEAM DICTIONARY

In [15]:
dict_teams = {
    'Cardinals': 'ARI', 'Falcons': 'ATL', 'Ravens': 'BAL', 'Bills': 'BUF', 'Panthers': 'CAR', 'Bears': 'CHI',
    'Bengals': 'CIN', 'Browns': 'CLE', 'Cowboys': 'DAL', 'Broncos': 'DEN', 'Lions': 'DET', 'Packers': 'GB',
    'Texans': 'HOU', 'Colts': 'IND', 'Jaguars': 'JAX', 'Chiefs': 'KC', 'Raiders': 'LV', 'Chargers': 'LAC',
    'Rams': 'LAR', 'Dolphins': 'MIA', 'Vikings': 'MIN', 'Patriots': 'NE', 'Saints': 'NO', 'Giants': 'NYG',
    'Jets': 'NYJ', 'Eagles': 'PHI', 'Steelers': 'PIT', '49ers': 'SF', 'Seahawks': 'SEA', 'Buccaneers': 'TB',
    'Titans': 'TEN', 'Commanders': 'WAS'
}

In [16]:
dict_teams_2 = {
    'ARI': 'Cardinals', 'ATL': 'Falcons', 'BAL': 'Ravens', 'BUF': 'Bills', 'CAR': 'Panthers', 'CHI': 'Bears',
    'CIN': 'Bengals', 'CLE': 'Browns', 'DAL': 'Cowboys', 'DEN': 'Broncos', 'DET': 'Lions', 'GB': 'Packers',
    'HOU': 'Texans', 'IND': 'Colts', 'JAX': 'Jaguars', 'KC': 'Chiefs', 'LV': 'Raiders', 'LAC': 'Chargers',
    'LAR': 'Rams', 'MIA': 'Dolphins', 'MIN': 'Vikings', 'NE': 'Patriots', 'NO': 'Saints', 'NYG': 'Giants',
    'NYJ': 'Jets', 'PHI': 'Eagles', 'PIT': 'Steelers', 'SF': '49ers', 'SEA': 'Seahawks', 'TB': 'Buccaneers',
    'TEN': 'Titans', 'WAS': 'Commanders'
}

## 2. REGULAR EXPRESSIONS

In [17]:
####################################################
# REGULAR EXPRESSIONS USED TO LOCATE SPECIFIC DATA #
####################################################

# Will eventually have to combine some regular expressions into one
# - For example, punt returns <-> kick returns <-> interceptions <-> fumble recoveries (?)

###########
# GENERAL #
###########

# Players name (Grabs every variation come across so far)
name_pattern = "(?:[A-Za-z]+-)*[A-Za-z]+\.[A-Za-z]+(?:-[A-Za-z]+)*"

# Player team (Grabs team of player)
team_name_pattern = "([A-Za-z]+)-(?:[A-Za-z]+-)*[A-Za-z]+\.[A-Za-z]+(?:-[A-Za-z]+)*"

################
# PLAY DETAILS #
################

# Play start time
time_on_clock_pattern = r'\((\d*:\d+)\)'

# Offense play formation
formation = r'\(([A-Za-z]+ ?[A-Za-z]*,? ?[A-Za-z]*)\)'

# Yards gained on play
yardage_gained = r'for (-?[0-9]+) yards?'

# Positioning of the start of the play
play_start_pattern = "(?:1st|2nd|3rd|4th) & [0-9]+ at ([A-Z]+) ([0-9]+)"

# Positioning at the end of the play
# to GB 35 for 11 yards
# (2:31) (Shotgun) E.Ezukanma left end pushed ob at LAC 27 for 7 yards (A.Gilman).
play_end_pattern = "(?:to|at) (?:([A-Z]+) )?([0-9]+) for (-?[0-9]+) yards?"

# Yardage from penalty
# , 15 yards, enforced at NYG 38.
penalty_yardage_pattern = ", ([0-9]+) yards?, enforced at (?:([A-Z]+) )?([0-9]+)"

###########
# OFFENSE #
###########

# Passer (Player passing, Player spiking, Player who got sacked)
passer_name_pattern = f"({name_pattern}) (?:pass|spiked|sacked)"

# Rushing play (Player running ball)
rusher_pattern = f"({name_pattern})(?: scrambles)? (?:left|right|up|kneels).?"

# Pass play (Returns intended receiver and the direction of the pass)
receiver_pattern = f"(short|deep) (left|right|middle) (?:to|intended for) ({name_pattern})"

# 2 Point Conversion (Pass attempt)
tp_conversion_pass_pattern = f"({name_pattern}) pass to ({name_pattern})"

# 2 Point Conversion (Rush attempt)
tp_conversion_rush_pattern = f"({name_pattern}) rushes (?:left|right|up)"

# Handoff
handoff_pattern = f"Handoff to ({name_pattern}) to(?: [A-Z]+)? [0-9]+ for -?[0-9]+ yards?"

###########
# DEFENSE #
###########

# Tackles (solo, assist, shared) <-- the goal. Right now all I have is tackle1 and tackle2

# Main defender on play (Used to grab tackler1 and used to grab players that sacked the passer)
defense_tackler_1_name_pattern = f"\(({name_pattern})"

# Second defender on play (Used to grab tackler2)
defense_tackler_2_name_pattern = f" ({name_pattern})\)" # Will have a ")" at the end of the name



solo_tackle_pattern = f"\(({name_pattern})\)"

shared_tackle_pattern = f"\(({name_pattern}), ({name_pattern})\)"

assisted_tackle_pattern = f"\(({name_pattern}); ({name_pattern})\)"



# Pressure (Who applied pressure to passer)
# - I think it might be possible for multiple defenders to apply pressure to the passer.
defense_pressure_name_pattern = f"\[({name_pattern})\]"

# Interception (Player who intercepted pass)
interception_name_pattern = f"INTERCEPTED by ({name_pattern})"

# Quarterback Fumbles (Quarterback fumble solo, Quarterback fumble solo -> who recovers, Quarterback <-> Center discrepancy)

# How far passer went before fumbling on his own
qb_fumble_pattern = f" ({name_pattern}) to(?: [A-Z]+) [0-9]+ for -?[0-9]+ yards$" # Passer fumbles are always the initial action of the play

# Action directly after a quarterback only fumble
qb_fumble_description_pattern = f"^FUMBLES, "

# Fumble missnap (Will either be the quarterback or center.)
aborted_fumble_pattern = f"({name_pattern}) FUMBLES"

# Forced fumbles (Player who forced the fumble)
forced_fumble_pattern = f"FUMBLES \(({name_pattern})\)"

# Sack (Who is credited with a sack, who split sack, how many yards was the sack)

# Fumble from sack (Player who forced the fumble on a sack)
sacked_forced_fumble_sentence = f"FUMBLES \({name_pattern}\) \[({name_pattern})\]"

# Split sack (Players who equally received credit for sack)
split_sack_pattern = f"sack split by ({name_pattern}) and ({name_pattern})"

# Yardage of sack (starting from line of scrimmage)
yardage_from_sack = r'sacked(?: ob)? at(?: [A-Z]+)? [0-9]+ for (-?[0-9]+) yards'

# Defense takeaway (takeaway for yardage)
defensive_takeaway_run_pattern = f"^({name_pattern}) (?:pushed ob at|ran ob at|to)(?: [A-Z]+) -?[0-9]+ for " # yardage after fumble recovery & yardage after interception

# Defense takeaway (takeaway for touchdown)
touchdown_after_takeaway_pattern = f"({name_pattern}) for [0-9]+ yards, TOUCHDOWN" # touchdown after a fumble recovery or interception

#################
# SPECIAL TEAMS #
#################

# Punting play (Who was the punter, How many yards the ball went, Who was the Longsnapper)
punting_pattern = f"({name_pattern}) punts (-?[0-9]+) yards? to(?: [A-Z]+ -?[0-9]+| -?[0-9]+| end zone), Center-({name_pattern})"

# Punt return (Who was returning the punt, How many yards did they go, The player(s) that tackled the returner)
# punt_return_pattern = f"({name_pattern}) (?:pushed ob at|ran ob at|to)(?: [A-Z]+)? [0-9]+ for (-?[0-9]+) yards? \(({name_pattern})(?:(?:,|;) ({name_pattern}))?\)" # yardage after punt
punt_return_pattern = f"({name_pattern}) (?:pushed ob at|ran ob at|to)(?: [A-Z]+)? [0-9]+ for"

# J.Reed (didn't try to advance) to CHI 44 for no gain.
kick_return_pattern = f"({name_pattern})(?: \(didn't try to advance\))? (?:pushed ob at|ran ob at|to)(?: [A-Z]+)? [0-9]+ for (no gain|(-?[0-9]+) yards? \(({name_pattern})(?:(?:,|;) ({name_pattern}))?\))" # yardage after kickoff

# Punt return resulting in fair catch
punt_fair_catch_pattern = f", fair catch by ({name_pattern})"

# Punt or kickoff downed by
kick_downed_by_pattern = f", downed by ({name_pattern})"

# Kickoff play (Who was the kicker, How many yards the ball was kicked )
kickoff_pattern = f"({name_pattern}) kicks(?: onside)? (-?[0-9]+) yards from"

# Field goal (Good)
field_goal_good_pattern = f"({name_pattern}) (-?[0-9]+) yard field goal is GOOD, Center-({name_pattern}), Holder-({name_pattern})."

# Field goal (no good)
field_goal_no_good_pattern = f"({name_pattern}) (-?[0-9]+) yard field goal is No Good, ([A-Za-z]+(?: [A-Za-z]+)*), Center-({name_pattern}), Holder-({name_pattern})."

# Field goal (blocked)
field_goal_blocked_pattern = f"({name_pattern}) (-?[0-9]+) yard field goal is BLOCKED \(({name_pattern})\), Center-({name_pattern}), Holder-({name_pattern}), RECOVERED by ({name_pattern})"

# Extra point (good)
extra_point_good_pattern = f"({name_pattern}) extra point is GOOD, Center-({name_pattern}), Holder-({name_pattern})."

# Extra point (no good)
extra_point_no_good_pattern = f"({name_pattern}) extra point is No Good, ([A-Za-z]+(?: [A-Za-z]+)*), Center-({name_pattern}), Holder-({name_pattern})."

##############
#  INJURIES  #
##############

# Injuries (Returns the player(s) who go injuried during play)
# injury = f"[A-Z]+-({name_pattern}) was injured during the play"
injury_pattern = f"[A-Z]+-({name_pattern}) was injured during the play"

## PREPROCESSING DATA

In [18]:
# Value mapping the "Quarter" feature
week1_2023_plays['Quarter'].unique()

array(['1ST QUARTER', '2ND QUARTER', '3RD QUARTER', '4TH QUARTER',
       'OVERTIME'], dtype=object)

In [19]:
week1_2023_plays_modified = week1_2023_plays.copy()

dict_replace_quarter = {'1ST QUARTER': 1, '2ND QUARTER': 2, '3RD QUARTER': 3, '4TH QUARTER': 4, 'OVERTIME': 5}

week1_2023_plays_modified['Quarter'] = week1_2023_plays_modified['Quarter'].map(dict_replace_quarter)

## 3. CLEANING METHODS

###HELPER CLEANING METHODS

#### helper method for fumbles

In [73]:
# PURPOSE:
# - Universal helper method that extracts fumbled data from every playtype.

# BASIC PLAN:
# 1. Accept a single row of a play that has been fumbled from the main dataframe of plays.
# 2. Replace that single row with a dataframe containing all extracted data.
#    - These replacement dataframes are not limited to a single row but can be many, depending on the play.

# BASIC DESIGN STEP BY STEP:
# 1. Split play description into significant actions and put into a list
#    EXAMPLES:
#    - intended play
#    - fumble recovery for yardage
# 2. Clean significant actions as their own rows
#    EXAMPLE METHODS USED TO CLEAN:
#    - main cleaning method (method used to clean a playtype that is using this helper method)
#    - run playtype cleaning method (Will be used to clean all fumble recoveries for yardage)
# 3. Create and return replacement dataframe containing all cleaned significant actions (or rows)

# INPUT PARAMETERS:
# df_plays                  - dataframe - dataframe of plays
# play                      -  String   - 'PlayDescription' of the current play that is being cleaned
# play_index                -  Integer  - index of play (Almost always from main dataframe of plays)
# main_action_patterns      -    list   - A list of regular expressions that are meant to pinpoint primary
#                                         actions within a play that will be used to extract these actions
#                                         to create a row within the replacement dataframe
# main_cleaning_method      - function  - A callback function (the function using this helper method) which
#                                         is used to clean intended play actions

# RETURN:
# df_multi_row_play - dataframe - dataframe of organized and cleaned actions stemming from a single unclean fumbled play

# NOTE: I need to comment effectively, grabbing all the nuances of what is being grabbed
#       for each playtype. All playtypes are different and need to be described.

# CONCERNS:
# 1. Nuance on sacked plays
#    - Formatting of defender who caused sack is different from a solo and an assisted
# 2. Who is at fault for aborted plays
#    - Formatting on aborted plays is different if the fault lands on the center or passer
# 3. May have to add the parameter "secondary_action_patterns"
#    - I just ran into the issue of a kickoff return fumble.
#      - In this case there is 1. the kickoff 2. the kickoff return 3. the fumble from kickoff return.

def extract_fumble_data(df_plays, play, play_index, main_action_patterns, main_cleaning_method):

  original_play_copy = df_plays.loc[play_index]

  # Breaking play description into a list of sentences
  play_elements = play.split(". ")

  #################
  # KEY VARIABLES #
  #################

  # 'play_split' info:
  # - Designed to be a 2D list (list of lists)
  # - All elements within this list together will represent a single play.
  # - Each element within the list will become a separate row that will replace/add to the original dataframe of plays.
  #   - Each element represents a distict action within the single play and will have all data required for that new row.
  #   ROW CONTENTS:
  #   1. [ ( The intended play ) + ( Extra data ) , ( Who caused the fumble ) ]   <-  This row will have extra info such as (injuries / penalties / eligibility / etc...)
  #                                                                                   - "The intended play" includes 'Aborted' plays
  # ~ 2. [            ( The fumble recovery )     , ( Who caused the fumble ) ]   <-  This can happen repeatedly or not at all
  # ~ 3. [ (The fumble recovery for a touchdown) ]                                <-  This can only happen once for a single play or not at all
  play_split = []

  # 'extra_data' info:
  # - Will be a single string containing all additional data from the play such as (injuries / penalties / eligibility / etc...)
  # - Will be put into a single row dataframe and cleaned
  #   - Once extra data has been cleaned, the single row (now clean) dataframe will serve as a shell for
  #     the first new row that will replace the old play within the main dataframe.
  #     - This first new row will have the initial action of the play as well as all additional information from the play
  extra_data = ""

  # - Iterate through each element within play_elements
  # - NOTE: We are iterating through actions of the play cronologically
  for string in play_elements:

    ######################################
    # ORGANIZING KEY ACTIONS WITHIN PLAY #
    ######################################

    # ACTIONS WITHIN PLAY THAT DESERVE THEIR OWN ROW:
    # These situations will have their own list element within "play_split" (meaning their own row within the new cleaned replacement dataframe)
    # 1. intended play (initial action might be a better name for plays such as ones that have been aborted)
    #    RUN PLAYS:
    #     - Fumbles after inteded run play
    #     - Aborted fumbles
    #     - qb only fumbles
    #    SACKED PLAYS:
    #     - fumbles after sack
    #    PASSING PLAYS:
    #     - Fumbles after intended pass play
    #     - qb only fumbles
    #    KICKOFF PLAYS:
    #     - Fumbles happen during kickoff return
    # 2. runs after fumble recoveries (emphasis on the plural)
    # 3. touchdown after fumble recovery (can only happen once) (looks unique for each playtype) <- this might not be true.
    #    ! ! ! ATTENTION ! ! !
    #    - I have a small sample size for this.
    #    - This is one thing that I need to double check correctness on later in the future when having a larger sample size.
    #    RUNS PLAYS:
    #    - Are fumble recovery touchdown from run plays accounted for?
    #    SACKED PLAYS
    #    - touchdown after a sacked play
    #    PASSING PLAYS:
    #    - Are fumble recovery touchdown from passing plays accounted for?
    #
    #    - Are all fumble recoveries the same? wouldn't they all be rushing playtypes?
    # 4. handoffs
    for play_pattern in main_action_patterns:
      if re.search(play_pattern, string) != None:
        play_split.append([string])
        break
    if re.search(play_pattern, string) != None:
      continue

    # ADD ON SECTION (Actions that will add to elements that will obtain their own row)
    # - Appends data to elements within 'play_split'
    #   - Every element within play_split is a list, this section will add to those individual lists
    #     - Specifically it will append to the last element within 'play_split' and the reason for that
    #       is because as we are iterating through sentences cronologically, the appending element
    #       will always follow directly after the element that needs it
    # These situations will add to the last element within 'play_split' (For all playtypes)
    # 1. forced fumble description (happens after regular plays & sometimes after fumble recoveries)
    # 2. fumble description describing a qb only fumble (happens after a qb only fumble)
    for play_pattern in [forced_fumble_pattern, qb_fumble_description_pattern]:
      if re.search(play_pattern, string):
        index_last_element = len(play_split) - 1
        play_split[index_last_element] = [play_split[index_last_element][0], string]
        break
    if re.search(play_pattern, string) != None:
      continue

    # When a sentence does not fit within the top 2 sections ( 1. adding an element to the list || 2. appending to an element in the list )
    # - Glue the sentence into 'extra_data' to be cleaned separately.
    extra_data = extra_data + string + ". "

  ################################
  # CLEANING ACTIONS WITHIN PLAY #
  ################################

  # GRABBING: Initial action of play (e.g. Intended play / aborted fumble / qb only fumble / etc...)
  intended_play_description = play_split.pop(0)

  # Creating a single row dataframe of the original play
  unclean_original_play_copy = pd.DataFrame([original_play_copy.copy()], columns=df_plays.columns)

  # CREATING SHELL FOR: Initial action of play
  # - shell is only necessary with plays that have extra data (injuries / penalties / eligibility / etc...)
  # - extra data will only be available within the first row of the replacement dataframe
  if extra_data:
    unclean_original_play_copy['PlayDescription'] = extra_data
    unclean_original_play_copy = main_cleaning_method(unclean_original_play_copy)

  # CLEANING: Initial action of play
  # No matter what the initial action is, the description will always be the first element of the first element within 'play_split'
  unclean_original_play_copy['PlayDescription'] = intended_play_description[0]

  # May have to adjust in the future.
  # - ON SACKED PLAYS, there is nuance on the formatting of a player who caused a sack and a forced fumble.
  #   - Sometimes it'll look something like this "FUMBLES (B.Burns) [B.Burns]" <- [B.Burns] is credited with the forced fumble
  #   - less often it'll look like "FUMBLES (B.Burns)" <- B.Burns is credited with the forced fumble.
  # - ON ABORTED PLAYS, there is nuace on the formatting of a player who caused the play to be aborted.
  #   - the word "Aborted" will either be in parenthesis or without, this signals whether the center was at fault or the passer.
  #     - Need to figure out how to record this data.
  # - ON KICKOFF PLAYS
  #   - Because there is the kickoff, then the kickoff return, then the fumble on the kickoff return,
  #     the intended play will not have the fumble detail but still needs to be cleaned.

  # intended play / qb only fumble
  if len(intended_play_description) > 1:
    unclean_original_play_copy['FumbleDetails'] = intended_play_description[1]
    forced_fumble = re.findall(forced_fumble_pattern, intended_play_description[1])
    if len(forced_fumble) > 0:
      unclean_original_play_copy['ForcedFumbleBy'] = forced_fumble[0]
    cleaned_original_play_copy = main_cleaning_method(unclean_original_play_copy)

  # kickoff (fumble occurs after kickoff return)
  kickoff = re.findall(kickoff_pattern, intended_play_description[0])
  if len(kickoff) > 0:
    cleaned_original_play_copy = main_cleaning_method(unclean_original_play_copy)

  # blocked field goal recovery (fumble would occur during the recovery run) <-------
  # - I think here I will need to:
  #   1. separate 'PlayDescription' by ", "
  #   2. Remove the section that states who recovered the blocked field goal attempt
  #      - Hold to add back into 'PlayDescription' after rest has been cleaned.
  #      - This is important because if I were to send it to be cleaned along with
  #        the rest of the string, it would cause an infinite loop.
  #   3. clean the rest of the string using the main cleaning method
  field_goal_blocked = re.findall(field_goal_blocked_pattern, intended_play_description[0])
  if len(field_goal_blocked) > 0:
    field_goal_blocked_elements = intended_play_description[0].split(", ")
    for i in field_goal_blocked_elements:
      if i.lower().find('recovered') != -1:
        field_goal_blocked_elements.pop(field_goal_blocked_elements.index(i))
        print(", ".join(field_goal_blocked_elements))
        unclean_original_play_copy['PlayDescription'] = ", ".join(field_goal_blocked_elements)
        cleaned_original_play_copy = main_cleaning_method(unclean_original_play_copy)
        cleaned_original_play_copy['PlayDescription'] = f"{unclean_original_play_copy['PlayDescription'], {i}}"
        print(cleaned_original_play_copy['Yardage'].iloc[0])
        break

  # Aborted fumble (Dig more into this. I dont think this captures only ABORTED fumbles)
  # I think I have to make this more specific than dumping everything else into here.
  else:
    unclean_original_play_copy['FumbleDetails'] = intended_play_description[0]
    cleaned_original_play_copy = unclean_original_play_copy

  # FUMBLE RECOVERIES FOR YARDAGE & FUMBLE RECOVERIES FOR TOUCHDOWNS

  # Created list for the possibility of having multiple fumbles and recoveries in a single play
  list_recovery_runs = []

  for play in play_split:

    recovery_row = pd.DataFrame([original_play_copy.copy()], columns=df_plays.columns)

    # Recovery after fumble was fumbled
    if len(play) > 1:
      recovery_row['FumbleDetails'] = play[1]
      forced_fumble = re.findall(forced_fumble_pattern, play[1])
      if len(forced_fumble) > 0:
        recovery_row['ForcedFumbleBy'] = forced_fumble[0]

    recovery_row['PlayDescription'] = play[0]
    # Pass after fumble recovery
    pass_play = re.findall(passer_name_pattern, play[0])
    if len(pass_play) > 0:
      recovery_row['PlayOutcome'] = 'Pass'
      cleaned_recovery_row = clean_pass_plays(recovery_row)
    # Everything else can be labeled as a run play
    else:
      recovery_row['PlayOutcome'] = 'Run' # <-- Possibly change this in the future (could be something like 'fumble recovery run?' unless it was the rb that recovered)
      cleaned_recovery_row = clean_run_plays(recovery_row)

    cleaned_recovery_row['PlayOutcome'] = original_play_copy['PlayOutcome'] # <- Maybe this isn't correct? when a play is split by multiple rows, this becomes tricky.
    list_recovery_runs.append(cleaned_recovery_row)

  ###################
  # 3.NEW DATAFRAME #
  ###################
  # - Create the cleaned replacement row(s) for the original row.

  if len(list_recovery_runs) > 0:
    df_multi_row_play = pd.DataFrame(columns=df_plays.columns)
    df_multi_row_play = pd.concat([cleaned_original_play_copy, *list_recovery_runs], ignore_index=True)
  else:
    df_multi_row_play = cleaned_original_play_copy

  return df_multi_row_play

#### helper method for penalties

In [21]:
# I want to see how many rushing yards are awarded to the rusher when a penalty is involved.

# RUSHING PLAYS
# - If the penalty was beyond the line of scrimmage and brought back, the rusher is
#   awarded with any positive gained yards up to the spotting of the ball.
# - If the penalty was behind or at the line of scrimmage, the play does does not count.

# - I think here I should find:
#   1. Where the play started
#      - How am I supposed to know which direction the offensive is going based off of a play?
#        - Do I look at if the gain was positive? And see from marker where they started to
#          where they finished?
#        - How do I know if going higher or going lower than the starting position is positive?
#          - I need to figure out if the took place before, on or behind the line of scrimmage.
#          - I feel like it might help if I use
#            1. the starting position
#            2. yardage gained from the play and what position on the field it ended up
#               - Whether that yardage gained was positive or negative

#              if (starting position yard < ending position yard) &
#                 (play gain yardage > 0):
#                 - Positive would be
#                   - anything greater than position yard (team side)
#                   - anything on opposing team side
#              EXAMPLE:
#              starting position = GB 24
#              ending position = GB 35
#              play gain yardage = 11
#              - Positive would be:
#                - 'GB' (>24)
#                - 'Opposing team acronym' (49-1)

#             There has to be an easier way to do this.
#             - Maybe I can go based off 100 (the entire football field)
#               - So I can figure out which direction is positive and add either 0 or 50 to the
#                 yardage based off of the team acronym.
#                 - I need to figure out which direction is positive from the
#                   starting position.


#   - Figure out which team area is the starting point.
#     - Figure this out for 0 yard run too



#   2. Where the penalty was enfored
#   3. Where the ball is placed

def accepted_penalty_play_on_offense(df_plays, play, index_of_play):

  print(index_of_play)

  print(df_plays['PlayOutcome'].loc[index_of_play])


  # Ultimately, All I want is to figure out how much positive yards were gained if any
  # on an accepted offensive penalty.

  # position yardage (+) & play yardage (+):
  # - the starting position team zone is the beginning (0-50) (home territory)
  # position yardage (-) & play yardage (+):
  # - the starting position team zone is the ending (50-100) (enemy territory)
  # position yardage (+) & play yardage (-):
  # - the starting position team zone is the ending (50-100) (enemy territory)
  # position yardage (-) & play yardage (-):
  # - the starting position team zone is the beginning (0-50) (home territory)

  # Unique cases
  # zones switch (home territory -> enemy territory) (e.g. KC 47 -> BUF 47)
  # play yardage (+):
  # - the starting position team zone is the beginning (0-50)
  # play yardage (-):
  # - the starting position team zone is the ending (50-100)
  # penalty occured on the line of scrimmage
  # - doesn't matter. yardage gained is 0

  # What can I do with knowledge of which zone is 0-50 yards in a drive?
  # - Ultimately I want to be able to find how much yards are awarded to offensive
  #   players during a penalized play.
  #   - I can take the ending point, subtract the starting, then subtract the penalty.
  #     - If any yards are left over, then those yards are awarded to the rushing play (for run plays)

  # What do I need to figure out?
  # - Starting point (on 100 yard format)
  # - ending point (on 100 yard format)
  # - penalty enforcement (on 100 yard format)
  # - offensive or defensive penalty
  # - subtract penalty
  # - If resulting number is less than starting point, yardage = 0
  # - If resulting number is greater than starting point, yardage = result

  # Play start
  play_start = df_plays['PlayStart'].loc[index_of_play]
  play_start_elements = re.findall(play_start_pattern, play_start)
  if len(play_start_elements) > 0:
    print(play_start_elements)

  # Play end
  play_end_elements = re.findall(play_end_pattern, play)
  if len(play_end_elements) > 0:
    print(play_end_elements)

  # Penalty data
  penalty_elements = re.findall(penalty_yardage_pattern, play)
  print(penalty_elements)


  # I cannot forget to take care of a 0 yard gain.
  # What happens if a penalty is on the 50 with no team acronym?


  # 0-100 yard format (example 'GB 30': 30 yards OR 80 yards)
  starting_position = 0
  ending_position = 0
  penalty_enforcement_position = 0

  # Starting of play and ending of play are in the same 50 yard zone
  if play_start_elements[0][0] == play_end_elements[0][0]:
    # position yardage (+)
    if play_start_elements[0][1] < play_end_elements[0][1]:
      # play yardage (+)
      if int(play_end_elements[0][2]) > 0:
        # print(f'{play_start_elements[0][0]} zone is (0-50)')
        starting_position = int(play_start_elements[0][1])
        ending_position = int(play_end_elements[0][1])
        penalty_enforcement_position = int(penalty_elements[0][2])
      # play yardage (-)
      else:
        print(f'{play_start_elements[0][0]} zone is (50-100)')
        starting_position = 100 - int(play_start_elements[0][1])
        ending_position = 100 - int(play_end_elements[0][1])
        penalty_enforcement_position = 100 - int(penalty_elements[0][2])
    # position yardage (-)
    else:
      # play yardage (+)
      if int(play_end_elements[0][2]) > 0:
        print(f'{play_start_elements[0][0]} zone is (50-100)')
        starting_position = 100 - int(play_start_elements[0][1])
        ending_position = 100 - int(play_end_elements[0][1])
        penalty_enforcement_position = 100 - int(penalty_elements[0][2])
      # play yardage (-)
      else:
        print(f'{play_start_elements[0][0]} zone is (0-50)')
        starting_position = int(play_start_elements[0][1])
        ending_position = int(play_end_elements[0][1])
        penalty_enforcement_position = int(penalty_elements[0][2])
  else:
    # play yardage (+)
    if int(play_end_elements[0][2]) > 0:
      print(f'{play_start_elements[0][0]} zone is (0-50)')
      zero_to_fifty_zone = play_start_elements[0][0]
      starting_position = int(play_start_elements[0][1])
      ending_position = 100 - int(play_end_elements[0][1])
      if penalty_elements[0][1] == zero_to_fifty_zone:
        penalty_enforcement_position = int(penalty_elements[0][2])
      else:
        penalty_enforcement_position = 100 - int(penalty_elements[0][2])
    # play yardage (-)
    else:
      print(f'{play_start_elements[0][0]} zone is (50-100)')
      fifty_to_hundred_zone = play_start_elements[0][0]
      starting_position = 100 - int(play_start_elements[0][1])
      ending_position = int(play_end_elements[0][1])
      if penalty_elements[0][1] == fifty_to_hundred_zone:
        penalty_enforcement_position = 100 - int(penalty_elements[0][2])
      else:
        penalty_enforcement_position = int(penalty_elements[0][2])

  print(starting_position)
  print(ending_position)
  print(penalty_enforcement_position)


  resulting_position = penalty_enforcement_position - int(penalty_elements[0][0])
  print(resulting_position)


  print(resulting_position - starting_position)

  print(play)
  print()

### OFFENSE CLEANING METHODS

#### PASS PLAYS

In [22]:
# PURPOSE:
# - Clean all passing type plays within a given dataframe.
# INPUT PARAMETERS:
# df_plays    - dataframe - NFL plays (can include play types other than passing)
# index_start -  integer  - index where within the dataframe the method will start
#                           cleaning in ascending order.
# RETURN:
# df_plays - dataframe - the same input df_plays but with all passing play types cleaned

# NOTE:
# - I want this to work with slices of the main dataframe as well.
#   - Within slices, I think it is crucial to keep the original indexing from the main
#     dataframe for ease to put back into the original dataframe.

def clean_pass_plays(df_plays, index_start = None):

  # Adjusting df_plays to start cleaning at a specified index (index_start)
  if index_start != None:
    df_plays_adjusted = df_plays.loc[index_start:]
    # Locating all passing type plays within dataframe
    df_pass_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('Pass')]
  else:
    # Locating all passing type plays within dataframe
    df_pass_plays = df_plays[df_plays['PlayOutcome'].str.contains('Pass')]

  for idx, play in df_pass_plays['PlayDescription'].items():

    ################
    # Play details #
    ################

    # Play Type
    df_plays.loc[idx, 'PlayType'] = 'Pass'

    # TimeOnTheClock
    TimeOnTheClock = re.findall(time_on_clock_pattern, play)
    if len(TimeOnTheClock) > 0:
      df_plays.loc[idx, 'TimeOnTheClock'] = TimeOnTheClock[0]

    ############
    # REVERSES #
    ############

    # In 'PlayDescription' all information before the "reversed" sentence is not needed.
    # - All information before is stored within 'ReverseDetails' and the remaining is cleaned.
    if play.find('REVERSED') != -1:
      play_elements = play.split(". ")
      for i in play_elements:
        if i.find("REVERSED") != -1:
          df_plays.at[idx, 'ReverseDetails'] = play_elements[:play_elements.index(i) + 1]
          play = ". ".join(play_elements[play_elements.index(i) + 1:])
          break

    ############################
    # REPORTING IN AS ELIGIBLE #
    ############################

    # I do not think this contains any useful data so I am going to exclude it.
    if play.find('reported in as eligible') != -1:
      play_elements = play.split(". ")
      for i in play_elements:
        if i.find('reported in as eligible') != -1:
          play = ". ".join(play_elements[play_elements.index(i) + 1:])
          break

    ###########
    # FUMBLES #
    ###########

    # Additional rows may be added after certain types of fumbled passing plays.
    # - The idea here is that, in those situations, the helping method 'extract_fumble_data'
    #   will return a small dataframe of the rows that the single play split into.
    #   - When this small dataframe is returned, it will replace the original play
    #     within the main dataframe of plays and then continue on cleaning the rest of the passing plays.

    if play.find('FUMBLES') != -1:
      main_action_patterns = [passer_name_pattern, qb_fumble_pattern, defensive_takeaway_run_pattern]
      main_cleaning_method = clean_pass_plays
      df_replacement_rows = extract_fumble_data(df_plays, play, idx,
                                                main_action_patterns,
                                                main_cleaning_method)

      # "df_plays.index.tolist().index(idx)" needed for method usage with slices of original dataframe.
      df_before = df_plays.iloc[:df_plays.index.tolist().index(idx)]
      df_after = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
      df_plays = pd.concat([df_before, df_replacement_rows, df_after], ignore_index=True)
      index_of_last_added_row = idx + len(df_replacement_rows) - 1
      if df_pass_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_pass_plays(df_plays, index_of_last_added_row + 1)

    ###########
    # OFFENSE #
    ###########

    # NOTE:
    # - Incomplete passes will have 'PlayOutcome' as 'Pass Incomplete' as well
    #   as yardage value being 0.0

    # Yardage gained
    yardage = re.findall(yardage_gained, play)
    if len(yardage) > 0:
      df_plays.loc[idx, 'Yardage'] = int(yardage[0])
    else:
      df_plays.loc[idx, 'Yardage'] = 0

    # Formation
    Formation = re.findall(formation, play)
    if len(Formation) > 0:
      if Formation[0] == 'Aborted':
        pass
      else:
        df_plays.loc[idx, 'Formation'] = Formation[0]

    # Passer (What about spikes?)
    passer_name = re.findall(passer_name_pattern, play)
    if len(passer_name) > 0:
      df_plays.loc[idx, 'Passer'] = passer_name[0]

    receiver_name_and_passing_details = re.findall(receiver_pattern, play)
    if len(receiver_name_and_passing_details) > 0:
      df_plays.loc[idx, 'Direction'] = f"{receiver_name_and_passing_details[0][0]} {receiver_name_and_passing_details[0][1]}"
      df_plays.loc[idx, 'Receiver'] = receiver_name_and_passing_details[0][2]

    # Unique situation (offense spikes the ball)
    if play.find('spike') != -1:
      df_plays.loc[idx, 'Direction'] = 'spiked' # Direction?

    #############
    #  DEFENSE  #
    #############

    solo_tackle = re.findall(solo_tackle_pattern, play)
    if len(solo_tackle) > 0:
      df_plays.loc[idx, 'SoloTackle'] = solo_tackle[0]

    shared_tackle = re.findall(shared_tackle_pattern, play)
    if len(shared_tackle) > 0:
      df_plays.at[idx, 'SharedTackle'] = shared_tackle[0]

    assisted_tackle = re.findall(assisted_tackle_pattern, play)
    if len(assisted_tackle) > 0:
      df_plays.loc[idx, 'SoloTackle'] = assisted_tackle[0][0]
      df_plays.loc[idx, 'AssistedTackle'] = assisted_tackle[0][1]

    pressure_by = re.findall(defense_pressure_name_pattern, play)
    if len(pressure_by) > 0:
      df_plays.loc[idx, 'PressureBy'] = pressure_by[0]

    ##############
    #  INJURIES  #
    ##############

    injuries = re.findall(injury_pattern, play)
    if len(injuries) > 0:
      df_plays.at[idx, 'InjuredPlayers'] = injuries

    #############
    #  PENALTY  #
    #############

    # Accepted Penalty
    if play.find('PENALTY') != -1:
      play_elements = play.split(". ")
      penalties = []
      for i in play_elements:
        if i.find('PENALTY') != -1:
          penalties.append(i)
      df_plays.at[idx, 'AcceptedPenalty'] = penalties

    # Declined Penalty
    if play.find('Penalty') != -1:
      play_elements = play.split(". ")
      penalties = []
      for i in play_elements:
        if i.find('Penalty') != -1:
          penalties.append(i)
      df_plays.at[idx, 'DeclinedPenalty'] = penalties

  if df_pass_plays.tail(1).index.tolist()[0] == idx:
    return df_plays

#### RUN PLAYS

In [23]:
# # PURPOSE:
# # - Clean run play types
# # INPUT PARAMETERS:
# # df_plays    - dataframe - dataframe of plays
# # index_start -  integer  - the starting index of the associated input dataframe
# #                           to begin cleaning.
# # RETURN:
# # df_plays - dataframe - dataframe of plays that now has all useful run play
# #                        data accessable and clean.

# # NOTE:
# # - Need to comment on how this is also a method being used for
# #   1. fumble recoveries for yardage
# #   2. fumble recoveries for touchdown
# # - I also have not come across a case where a rushing play has been fumbled and someone
# #   recovered the ball and scored a touchdown yet.

# def clean_run_plays(df_plays, index_start = None):

#   if index_start != None:
#     df_plays_adjusted = df_plays.loc[index_start:]
#     df_run_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('Run')]
#   else:
#     df_run_plays = df_plays[df_plays['PlayOutcome'].str.contains('Run')]

#   # Iterating through every run play within 'df_run_plays'
#   for idx, play in df_run_plays['PlayDescription'].items():

#     ################
#     # Play details #
#     ################

#     # Play Type
#     df_plays.loc[idx, 'PlayType'] = 'Run'

#     # TimeOnTheClock
#     TimeOnTheClock = re.findall(time_on_clock_pattern, play)
#     if len(TimeOnTheClock) > 0:
#       df_plays.loc[idx, 'TimeOnTheClock'] = TimeOnTheClock[0]

#     # Formation
#     Formation = re.findall(formation, play)
#     if len(Formation) > 0:
#       if Formation[0] == 'Aborted':
#         pass
#       else:
#         df_plays.loc[idx, 'Formation'] = Formation[0]

#     ############
#     # REVERSES #
#     ############

#     # In 'PlayDescription' all information before the "reversed" sentence is not needed.
#     # - All information before is stored within 'ReverseDetails' and the remaining is cleaned.
#     if play.find('REVERSED') != -1:
#       play_elements = play.split(". ")
#       for i in play_elements:
#         if i.find("REVERSED") != -1:
#           df_plays.at[idx, 'ReverseDetails'] = play_elements[:play_elements.index(i) + 1]
#           play = ". ".join(play_elements[play_elements.index(i) + 1:])
#           break

#     ############################
#     # REPORTING IN AS ELIGIBLE #
#     ############################

#     # I do not think this contains any useful data so I am going to exclude it.
#     if play.find('reported in as eligible') != -1:
#       play_elements = play.split(". ")
#       for i in play_elements:
#         if i.find('reported in as eligible') != -1:
#           play = ". ".join(play_elements[play_elements.index(i) + 1:])
#           break

#     ###########
#     # FUMBLES #
#     ###########

#     if play.find('FUMBLES') != -1:

#       # - I think it would help to comment on each action added
#       # - Does this catch fumble recovery touchdowns?
#       main_action_patterns = [rusher_pattern, aborted_fumble_pattern, qb_fumble_pattern, defensive_takeaway_run_pattern, handoff_pattern]
#       main_cleaning_method = clean_run_plays
#       df_replacement_rows = extract_fumble_data(df_plays, play, idx,
#                                                 main_action_patterns,
#                                                 main_cleaning_method)

#       # "df_plays.index.tolist().index(idx)" needed for method usage with slices of original dataframe.
#       df_before = df_plays.iloc[:df_plays.index.tolist().index(idx)]
#       df_after = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
#       df_plays = pd.concat([df_before, df_replacement_rows, df_after], ignore_index=True)
#       index_of_last_added_row = idx + len(df_replacement_rows) - 1

#       # returning row after the last index
#       if df_run_plays.tail(1).index.tolist()[0] == idx:
#         return df_plays
#       else:
#         return clean_run_plays(df_plays, index_of_last_added_row + 1)

#     #############
#     #  OFFENSE  #
#     #############

#     # Rusher
#     rusher_patterns = [rusher_pattern, defensive_takeaway_run_pattern, qb_fumble_pattern, touchdown_after_takeaway_pattern, handoff_pattern]
#     # Loop through patterns and find the first match
#     for pattern in rusher_patterns:
#       rusher = re.findall(pattern, play)
#       if len(rusher) > 0:
#         rusher_name = rusher[0]
#         df_plays.loc[idx, 'Rusher'] = rusher_name
#         break

#     # Direction
#     rushing_directions = ['guard', 'middle', 'tackle', 'end', 'kneels']
#     for i in rushing_directions:
#       if play.find(i) != -1:
#         start = play.find(rusher_name) + len(rusher_name) + 1
#         end = play.find(i) + len(i)
#         df_plays.loc[idx, 'Direction'] = play[start:end]
#         break

#     # Yardage gained
#     yardage = re.findall(yardage_gained, play)
#     if len(yardage) > 0:
#       df_plays.loc[idx, 'Yardage'] = int(yardage[0])
#     else:
#       df_plays.loc[idx, 'Yardage'] = 0

#     #############
#     #  DEFENSE  #
#     #############

#     solo_tackle = re.findall(solo_tackle_pattern, play)
#     if len(solo_tackle) > 0:
#       df_plays.loc[idx, 'SoloTackle'] = solo_tackle[0]

#     shared_tackle = re.findall(shared_tackle_pattern, play)
#     if len(shared_tackle) > 0:
#       df_plays.at[idx, 'SharedTackle'] = shared_tackle[0]

#     assisted_tackle = re.findall(assisted_tackle_pattern, play)
#     if len(assisted_tackle) > 0:
#       df_plays.loc[idx, 'SoloTackle'] = assisted_tackle[0][0]
#       df_plays.loc[idx, 'AssistedTackle'] = assisted_tackle[0][1]

#     ##############
#     #  INJURIES  #
#     ##############

#     injuries = re.findall(injury_pattern, play)
#     if len(injuries) > 0:
#       df_plays.at[idx, 'InjuredPlayers'] = injuries

#     #############
#     #  PENALTY  #
#     #############

#     # Accepted Penalty
#     if play.find('PENALTY') != -1:
#       play_elements = play.split(". ")
#       penalties = []
#       for i in play_elements:
#         if i.find('PENALTY') != -1:
#           penalties.append(i)
#       df_plays.at[idx, 'AcceptedPenalty'] = penalties

#     # Declined Penalty
#     if play.find('Penalty') != -1:
#       play_elements = play.split(". ")
#       penalties = []
#       for i in play_elements:
#         if i.find('Penalty') != -1:
#           penalties.append(i)
#       df_plays.at[idx, 'DeclinedPenalty'] = penalties

#     # Return if the last play has been cleaned in 'df_run_plays'
#     if df_run_plays.tail(1).index.tolist()[0] == idx:
#       return df_plays

In [24]:
# PURPOSE:
# - Clean run play types
# INPUT PARAMETERS:
# df_plays    - dataframe - dataframe of plays
# index_start -  integer  - the starting index of the associated input dataframe
#                           to begin cleaning.
# RETURN:
# df_plays - dataframe - dataframe of plays that now has all useful run play
#                        data accessable and clean.

# NOTE:
# - Need to comment on how this is also a method being used for
#   1. fumble recoveries for yardage
#   2. fumble recoveries for touchdown
# - I also have not come across a case where a rushing play has been fumbled and someone
#   recovered the ball and scored a touchdown yet.

def clean_run_plays(df_plays, index_start = None):

  if index_start != None:
    df_plays_adjusted = df_plays.loc[index_start:]
    df_run_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('Run')]
  else:
    df_run_plays = df_plays[df_plays['PlayOutcome'].str.contains('Run')]

  # Iterating through every run play within 'df_run_plays'
  for idx, play in df_run_plays['PlayDescription'].items():

    ################
    # Play details #
    ################

    # Play Type
    df_plays.loc[idx, 'PlayType'] = 'Run'

    # TimeOnTheClock
    TimeOnTheClock = re.findall(time_on_clock_pattern, play)
    if len(TimeOnTheClock) > 0:
      df_plays.loc[idx, 'TimeOnTheClock'] = TimeOnTheClock[0]

    # Formation
    Formation = re.findall(formation, play)
    if len(Formation) > 0:
      if Formation[0] == 'Aborted':
        pass
      else:
        df_plays.loc[idx, 'Formation'] = Formation[0]

    ############
    # REVERSES #
    ############

    # In 'PlayDescription' all information before the "reversed" sentence is not needed.
    # - All information before is stored within 'ReverseDetails' and the remaining is cleaned.
    if play.find('REVERSED') != -1:
      play_elements = play.split(". ")
      for i in play_elements:
        if i.find("REVERSED") != -1:
          df_plays.at[idx, 'ReverseDetails'] = play_elements[:play_elements.index(i) + 1]
          play = ". ".join(play_elements[play_elements.index(i) + 1:])
          break

    ############################
    # REPORTING IN AS ELIGIBLE #
    ############################

    # I do not think this contains any useful data so I am going to exclude it.
    if play.find('reported in as eligible') != -1:
      play_elements = play.split(". ")
      for i in play_elements:
        if i.find('reported in as eligible') != -1:
          play = ". ".join(play_elements[play_elements.index(i) + 1:])
          break

    ###########
    # FUMBLES #
    ###########

    if play.find('FUMBLES') != -1:

      # - I think it would help to comment on each action added
      # - Does this catch fumble recovery touchdowns? <--
      main_action_patterns = [rusher_pattern, aborted_fumble_pattern, qb_fumble_pattern, defensive_takeaway_run_pattern, handoff_pattern]
      main_cleaning_method = clean_run_plays
      df_replacement_rows = extract_fumble_data(df_plays, play, idx,
                                                main_action_patterns,
                                                main_cleaning_method)

      # "df_plays.index.tolist().index(idx)" needed for method usage with slices of original dataframe.
      df_before = df_plays.iloc[:df_plays.index.tolist().index(idx)]
      df_after = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
      df_plays = pd.concat([df_before, df_replacement_rows, df_after], ignore_index=True)
      index_of_last_added_row = idx + len(df_replacement_rows) - 1

      # returning row after the last index
      if df_run_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_run_plays(df_plays, index_of_last_added_row + 1)

    #############
    #  OFFENSE  #
    #############

    # Rusher
    rusher_patterns = [rusher_pattern, defensive_takeaway_run_pattern, qb_fumble_pattern, touchdown_after_takeaway_pattern, handoff_pattern]
    # Loop through patterns and find the first match
    for pattern in rusher_patterns:
      rusher = re.findall(pattern, play)
      if len(rusher) > 0:
        rusher_name = rusher[0]
        df_plays.loc[idx, 'Rusher'] = rusher_name
        break

    # Direction
    rushing_directions = ['guard', 'middle', 'tackle', 'end', 'kneels']
    for i in rushing_directions:
      if play.find(i) != -1:
        start = play.find(rusher_name) + len(rusher_name) + 1
        end = play.find(i) + len(i)
        df_plays.loc[idx, 'Direction'] = play[start:end]
        break

    # Yardage gained
    yardage = re.findall(yardage_gained, play)
    if len(yardage) > 0:
      df_plays.loc[idx, 'Yardage'] = int(yardage[0])
    else:
      df_plays.loc[idx, 'Yardage'] = 0

    #############
    #  DEFENSE  #
    #############

    solo_tackle = re.findall(solo_tackle_pattern, play)
    if len(solo_tackle) > 0:
      df_plays.loc[idx, 'SoloTackle'] = solo_tackle[0]

    shared_tackle = re.findall(shared_tackle_pattern, play)
    if len(shared_tackle) > 0:
      df_plays.at[idx, 'SharedTackle'] = shared_tackle[0]

    assisted_tackle = re.findall(assisted_tackle_pattern, play)
    if len(assisted_tackle) > 0:
      df_plays.loc[idx, 'SoloTackle'] = assisted_tackle[0][0]
      df_plays.loc[idx, 'AssistedTackle'] = assisted_tackle[0][1]

    ##############
    #  INJURIES  #
    ##############

    injuries = re.findall(injury_pattern, play)
    if len(injuries) > 0:
      df_plays.at[idx, 'InjuredPlayers'] = injuries

    #############
    #  PENALTY  #
    #############

    # I am going to create new features to add onto penalties
    # 1. Yardage from penalty
    #    - offensive penalty yards
    #    - defensive penalty yards
    # 2. Offensive penalty
    # 3. Defensive penalty
    # 4. Maybe I should add a 'total yards gained' feature?
    #    - This way, I can easily grab
    #      1. Yards from play
    #      2. Yards from penalty
    #      3. Yards gained all together
    # 5. Should I have a 'Player awarded yardage' feature?
    #
    # - I'm going to have to make a helper method to handle plays containing
    #   penalties.

    # Accepted Penalty
    if play.find('PENALTY') != -1:
      # accepted_penalty_play(df_run_plays, play, idx)
      play_elements = play.split(". ")
      offensive_penalties = []
      defensive_penalties = []
      for i in play_elements:
        if i.find('PENALTY') != -1:

          # Here I will see whether it was an offensive or defensive penalty
          penalty_player_team = re.findall(team_name_pattern, i)
          if len(penalty_player_team) > 0:
            if penalty_player_team[0] == df_run_plays['TeamWithPossession'].loc[idx]:
              accepted_penalty_play_on_offense(df_run_plays, play, idx)
              offensive_penalties.append(i)
            else:
              defensive_penalties.append(i)

      df_plays.at[idx, 'OffensivePenalty'] = offensive_penalties
      df_plays.at[idx, 'DefensivePenalty'] = defensive_penalties

    # Declined Penalty
    if play.find('Penalty') != -1:
      play_elements = play.split(". ")
      offensive_penalties = []
      defensive_penalties = []
      for i in play_elements:
        if i.find('Penalty') != -1:
          # Here I will see whether it was an offensive or defensive penalty
          penalty_player_team = re.findall(team_name_pattern, i)
          if len(penalty_player_team) > 0:
            if penalty_player_team[0] == df_run_plays['TeamWithPossession'].loc[idx]:
              offensive_penalties.append(i)
            else:
              defensive_penalties.append(i)

      df_plays.at[idx, 'OffensivePenalty'] = offensive_penalties
      df_plays.at[idx, 'DefensivePenalty'] = defensive_penalties

    # Return if the last play has been cleaned in 'df_run_plays'
    if df_run_plays.tail(1).index.tolist()[0] == idx:
      return df_plays

####2PT CONVERSIONS

In [25]:
# I NEED A LARGER SAMPLE SIZE FOR MORE PLAYS
# - I need a sample size that has fumbled plays (if that's possible?)
# - I need a sample size that has interception (if that's possible?)
# - I need a sample size with injuries (as dark as that may sound)

def cleaning_2pt_conversion_plays(df_plays, index_start = None):

  # Cut 'df_plays' to begin from 'index_start' to the last penalty play available in dataframe
  if index_start != None:
    df_plays_adjusted = df_plays.loc[index_start]
    df_2pt_conversion_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('2PT Conversion', case=False)]
  else:
    df_2pt_conversion_plays = df_plays[df_plays['PlayOutcome'].str.contains('2PT Conversion', case=False)]

  # Iterating through every penalty play within 'df_2pt_conversion_plays'
  for idx, play in df_2pt_conversion_plays['PlayDescription'].items():

    ###################
    # PASSING ATTEMPT #
    ###################

    pass_2ptc = re.findall(tp_conversion_pass_pattern, play)
    if len(pass_2ptc) > 0:
      df_plays.loc[idx, 'Passer'] = pass_2ptc[0][0]
      df_plays.loc[idx, 'Receiver'] = pass_2ptc[0][1]
      df_plays.loc[idx, 'PlayType'] = '2PT Conversion Pass'

    ###################
    # RUSHING ATTEMPT #
    ###################

    rush_2ptc = re.findall(tp_conversion_rush_pattern, play)
    if len(rush_2ptc) > 0:
      df_plays.loc[idx, 'Rusher'] = rush_2ptc[0]
      df_plays.loc[idx, 'PlayType'] = '2PT Conversion Run'
      # Direction
      rushing_directions = ['guard', 'middle', 'tackle', 'end', 'kneels']
      for i in rushing_directions:
        if play.find(i) != -1:
          start = play.find('rushes') + len('rushes') + 1
          end = play.find(i) + len(i)
          df_plays.loc[idx, 'Direction'] = play[start:end]
          break

    #############
    #  PENALTY  #
    #############

    # Accepted Penalty
    if play.find('PENALTY') != -1:
      play_elements = play.split(". ")
      penalties = []
      for i in play_elements:
        if i.find('PENALTY') != -1:
          penalties.append(i)
      df_plays.at[idx, 'AcceptedPenalty'] = penalties

    # Declined Penalty
    if play.find('Penalty') != -1:
      play_elements = play.split(". ")
      penalties = []
      for i in play_elements:
        if i.find('Penalty') != -1:
          penalties.append(i)
      df_plays.at[idx, 'DeclinedPenalty'] = penalties

  return df_plays

###DEFENSE CLEANING METHODS

#### INTERCEPTIONS

In [26]:
# PURPOSE:
# - Clean intercepted plays
# INPUT PARAMETERS:
# df_plays    - dataframe - dataframe of plays
# index_start -  integer  - the starting index of the associated input dataframe
#                           to begin cleaning.
# RETURN:
# df_plays - dataframe - dataframe of plays that now has all useful intercepted play
#                        data accessible and clean.

# ROUGH DESGIN
# 1. Narrow dataframe using 'index_start'
#    - This is a recursive method, the narrowing will get smaller and
#      smaller until all 'intercepted' type plays have been cleaned.
# 2. Grab first 'intercepted' play from narrowed dataframe
# 3. Create 2 single row dataframes.
#    a. intended play
#    b. yardage after interception
# 4. Break down play into sentences and clean
#    - Depending on the sentence within the play, will determine which
#      single row dataframe it will go to.
# 5. Combine both dataframes of cleaned data into one dataframe
# 6. Replace old play row with new cleaned multi row
# 7. return clean_interceped_plays( x , y)
#    - x = updated df_plays
#    - y = index directly after the last clean added row

# Concerns:
# ~ 1 ~
# PLAY SNIP - "(9:53) (Shotgun) D.Watson pass short left intended for E.Moore INTERCEPTED by D.Hill (Z.Carter) at CIN 30."
# - The concern here is (Z.Carter)
#   - I do not know what to categorize this player as? I believe that he had an impact on the play and could possibly be a reason
#     that D.Hill was able to intercept the ball.
#     - Should I create a feature called "ImpactPlayer" or something?
# ~ 2 ~
# PLAY SNIP - "(4:16) (Shotgun) J.Allen pass deep middle intended for S.Diggs INTERCEPTED by J.Whitehead [Q.Williams] at NYJ -1. Touchback."
# - The concern here is 'touchback'
#   - I have no idea what to do with that
# ~ 3 ~
#`- I do not have anything set in play to handle fumbles? What happens if a QB fumbles, recovers, then throws an interception? -> Then player that intercepted fumbles?
# ~ 4 ~
# - There are 2 rows within this sinlge play. (Intended throwing play, yardage after interception)
#   - For both of these rows that represent a single play, they both state that the throwing team has possession
#     - I do not know how this is going to effect the future with analysis on data
# - -----> GRAB DATA FOR TOUCHBACKS <-----
# - -----> GRAB DATA FOR PLAYTYPE INTERCEPTION FOR YARDAGE <-----

def clean_intercepted_plays(df_plays, index_start = None):

  # Will cut df_plays starting from index_start (narrowing our search space)
  if index_start != None:
    df_plays_adjusted = df_plays.loc[index_start:]
    df_intercepted_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('Interception')]
  else:
    df_intercepted_plays = df_plays[df_plays['PlayOutcome'].str.contains('Interception')]

  # Exit case (If no more 'Interception' type plays are found)
  if df_intercepted_plays.empty:
    return df_plays

  # Retrieve the index and 'PlayDescription' of the first intercepted play in 'df_intercepted_plays'
  # - Process one play per iteration in the recursive method
  idx = df_intercepted_plays.index[0]
  play = df_plays['PlayDescription'].loc[idx]

  ############
  # REVERSES #
  ############

  # In 'PlayDescription' all information before the "reversed" sentence is not needed.
  # - All information before is stored within 'ReverseDetails' and the remaining is cleaned.
  if play.find('REVERSED') != -1:
    play_elements = play.split(". ")
    for i in play_elements:
      if i.find("REVERSED") != -1:
        df_plays.at[idx, 'ReverseDetails'] = play_elements[:play_elements.index(i) + 1]
        play = ". ".join(play_elements[play_elements.index(i) + 1:])
        break

  # Create 2 single row dataframes.
  # 1. intended play
  df_intended_play = df_plays.loc[idx].copy()
  df_intended_play = pd.DataFrame([df_intended_play], columns=df_plays.columns)
  df_intended_play.reset_index(drop=True, inplace=True)
  df_intended_play['PlayDescription'] = 'nan'
  # 2. yardage after interception
  df_yardage_after_interception = df_plays.loc[idx].copy()
  df_yardage_after_interception = pd.DataFrame([df_yardage_after_interception], columns=df_plays.columns)
  df_yardage_after_interception.reset_index(drop=True, inplace=True)
  df_yardage_after_interception['PlayDescription'] = 'nan'

  # break down play by sentences.
  play_elements = play.split(". ")

  # Every sentence within 'PlayDescription' except yardage/touchdown after interception
  intended_play_data = []

  # iterate through play_elements
  for i in play_elements:

    ##############################
    # YARDAGE AFTER INTERCEPTION #
    ##############################

    yardage_after_interception = re.findall(defensive_takeaway_run_pattern, i)
    if len(yardage_after_interception) > 0:
      df_yardage_after_interception['PlayDescription'] = i

      # Player running after interception
      df_yardage_after_interception['Rusher'] = yardage_after_interception[0]

      # Playtype?
      # - Should this be a new playtype? Something like "RunAfterInterception"?

      # Yardage gained
      yardage = re.findall(yardage_gained, i)
      if len(yardage) > 0:
        df_yardage_after_interception['Yardage'] = int(yardage[0])
      else:
        df_yardage_after_interception['Yardage'] = 0

      # Who made tackle
      tackler = re.findall(solo_tackle_pattern, i)
      if len(tackler) > 0:
        df_yardage_after_interception['SoloTackle'] = tackler[0]

      continue

    ################################
    # TOUCHDOWN AFTER INTERCEPTION #
    ################################

    touchdown_after_interception_check = re.findall(touchdown_after_takeaway_pattern, i)
    if len(touchdown_after_interception_check) > 0:
      df_yardage_after_interception['PlayDescription'] = i

      # Player running after interception
      df_yardage_after_interception['Rusher'] = touchdown_after_interception_check[0]

      # Yardage gained
      yardage = re.findall(yardage_gained, i)
      if len(yardage) > 0:
        df_yardage_after_interception['Yardage'] = int(yardage[0])

      # # PlayOutcome
      # df_yardage_after_interception['PlayOutcome'] = 'Touchdown'

      # IsScoringPlay
      df_yardage_after_interception['IsScoringPlay'] = 1

      continue

    intended_play_data.append(i)

  #################
  # INTENDED PLAY #
  #################

  intended_play_playdescription = ". ".join(intended_play_data)

  df_intended_play['PlayDescription'] = intended_play_playdescription

  df_intended_play['PlayOutcome'] = 'Pass'
  df_intended_play = clean_pass_plays(df_intended_play)
  df_intended_play['PlayOutcome'] =  df_plays['PlayOutcome'].loc[idx]

  # Intercepted by
  intercepted_by = re.findall(interception_name_pattern, intended_play_playdescription)
  if len(intercepted_by) > 0:
    df_intended_play['InterceptedBy'] = intercepted_by[0]

  #############################
  # NEW REPLACEMENT DATAFRAME #
  #############################

  # combine both single row dataframes into one
  if df_yardage_after_interception['PlayDescription'].iloc[0] == 'nan':
    df_cleaned_replacement = df_intended_play
  else:
    df_cleaned_replacement = pd.concat([df_intended_play, df_yardage_after_interception], ignore_index=True)

  # Replace old row with new cleaned dataframe
  df_before_row = df_plays.iloc[:df_plays.index.tolist().index(idx)]
  df_after_row = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
  df_plays = pd.concat([df_before_row, df_cleaned_replacement, df_after_row], ignore_index=True)

  # If this is the last play in the dataset
  if df_intercepted_plays.tail(1).index.tolist()[0] == idx:
    return df_plays
  else:
    return clean_intercepted_plays(df_plays, idx+len(df_cleaned_replacement))

#### SACKS


In [27]:
def clean_sacked_plays(df_plays, index_start = None):

  # Will cut df_plays starting from index_start (narrowing our search space)
  if index_start != None:
    df_plays_adjusted = df_plays.iloc[index_start:]
    df_sacked_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('Sack')]
  else:
    df_sacked_plays = df_plays[df_plays['PlayOutcome'].str.contains('Sack')]

  for idx, play in df_sacked_plays['PlayDescription'].items():

    ################
    # Play details #
    ################

    # TimeOnTheClock
    TimeOnTheClock = re.findall(time_on_clock_pattern, play)
    if len(TimeOnTheClock) > 0:
      df_plays.loc[idx, 'TimeOnTheClock'] = TimeOnTheClock[0]

    ###########
    # FUMBLES #
    ###########

    if play.find('FUMBLES') != -1:

      main_action_patterns = [passer_name_pattern, defensive_takeaway_run_pattern, touchdown_after_takeaway_pattern]
      main_cleaning_method = clean_sacked_plays
      df_replacement_rows = extract_fumble_data(df_plays, play, idx,
                                                main_action_patterns,
                                                main_cleaning_method)

      # "df_plays.index.tolist().index(idx)" needed for method usage with slices of original dataframe.
      df_before = df_plays.iloc[:df_plays.index.tolist().index(idx)]
      df_after = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
      df_plays = pd.concat([df_before, df_replacement_rows, df_after], ignore_index=True)
      index_of_last_added_row = idx + len(df_replacement_rows) - 1
      # returning row after the last index
      if df_sacked_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_sacked_plays(df_plays, index_of_last_added_row + 1)

    #############
    #  OFFENSE  #
    #############

    # Formation
    Formation = re.findall(formation, play)
    if len(Formation) > 0:
      if Formation[0] == 'Aborted':
        pass
      else:
        df_plays.loc[idx, 'Formation'] = Formation[0]

    # Sacked Passer
    sacked_passer_name = re.findall(passer_name_pattern, play)
    if len(sacked_passer_name) > 0:
      df_plays.loc[idx, 'Passer'] = sacked_passer_name[0]

    # Yardage lost
    yardage = re.findall(yardage_from_sack, play)
    if len(yardage) > 0:
      df_plays.loc[idx, 'Yardage'] = int(yardage[0])

    #############
    #  DEFENSE  #
    #############

    # Solo sack (One person sacked the passer)
    solo_sack = re.findall(defense_tackler_1_name_pattern, play)
    if len(solo_sack) > 0:
      df_plays.loc[idx, 'SackedBy'] = solo_sack[0]

    # Split sack (A sack was given to the passer by multiple defenders)
    split_sack = re.findall(split_sack_pattern, play)
    if len(split_sack) > 0:
      df_plays.at[idx, 'SackedBy'] = split_sack[0]

    ##############
    #  INJURIES  #
    ##############

    injuries = re.findall(injury_pattern, play)
    if len(injuries) > 0:
      df_plays.at[idx, 'InjuredPlayers'] = injuries

    #############
    #  PENALTY  #
    #############

    # Accepted Penalty
    if play.find('PENALTY') != -1:
      play_elements = play.split(". ")
      penalties = []
      for i in play_elements:
        if i.find('PENALTY') != -1:
          penalties.append(i)
      df_plays.at[idx, 'AcceptedPenalty'] = penalties

    # Declined Penalty
    if play.find('Penalty') != -1:
      play_elements = play.split(". ")
      penalties = []
      for i in play_elements:
        if i.find('Penalty') != -1:
          penalties.append(i)
      df_plays.at[idx, 'DeclinedPenalty'] = penalties

    if df_sacked_plays.tail(1).index.tolist()[0] == idx:
      return df_plays

### SPECIAL TEAMS CLEANING METHODS

#### PUNTS

In [28]:
# A punt playtype will be split into 2 or more rows
#   1. The Punt
#      - 'PlayType'
#         - Punt
#      - 'Punter'
#      - 'LongSnapper'
#   2. The Punt Return
#      - 'PlayType'
#         - Punt Return
#      - 'PlayOutcome'
#         - x yard punt return
#         - fair catch
#         - touchback
#         - out of bounds
#         - downed
#      - 'Returner'
#      - 'Receiver'
#      - 'Yardage'
#      - 'TackleBy1'
#      - 'TackleBy2'
#      - 'DownedBy'

# I need to figure out a fake punt
# I need to figure out a punt that has been blocked
# I need to figure out what to do when a fumble happens
# I need to figure out what to do when a touchdown happens
# Maybe in the future, to make this more space friendly, I can combine features
# - Such as 'Punter' & 'LongSnapper' OR 'TackleBy1' & 'DownedBy'
#   OR 'Returner' & 'Receiver'

def clean_punt_plays(df_plays, index_start = None):

  # Will cut df_plays starting from index_start (narrowing our search space)
  if index_start != None:
    df_plays_adjusted = df_plays.loc[index_start:]
    df_punt_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('Punt')]
  else:
    df_punt_plays = df_plays[df_plays['PlayOutcome'].str.contains('Punt')]

  if df_punt_plays.empty:
    return df_plays

  # Retrieve the index and 'PlayDescription' of the first punt play in 'df_punt_plays'
  # - Process one play per iteration in the recursive method
  idx = df_punt_plays.index[0]
  play = df_plays['PlayDescription'].loc[idx]
  row_copy = df_plays.loc[idx].copy()

  ############
  # REVERSES #
  ############

  # In 'PlayDescription' all information before the "reversed" sentence is not needed.
  # - All information before is stored within 'ReverseDetails' and the remaining is cleaned.
  if play.find('REVERSED') != -1:
    play_elements = play.split(". ")
    for i in play_elements:
      if i.find("REVERSED") != -1:
        df_plays.at[idx, 'ReverseDetails'] = play_elements[:play_elements.index(i) + 1]
        play = ". ".join(play_elements[play_elements.index(i) + 1:])
        break

  # Create 2 single row dataframes.
  # 1. The Punt
  df_punt = row_copy
  df_punt = pd.DataFrame([df_punt], columns=df_plays.columns)
  df_punt.reset_index(drop=True, inplace=True)
  df_punt['PlayDescription'] = 'nan'
  # 2. The Punt Return
  df_punt_return = row_copy
  df_punt_return = pd.DataFrame([df_punt_return], columns=df_plays.columns)
  df_punt_return.reset_index(drop=True, inplace=True)
  df_punt_return['PlayDescription'] = 'nan'

  #############
  # PLAY TIME #
  #############

  time = re.findall(time_on_clock_pattern, play)
  if len(time) > 0:
    df_punt.loc[0, 'TimeOnTheClock'] = time[0]

  # break down play by sentences.
  play_elements = play.split(". ")

  accepted_penalties = []
  declined_penalties = []

  for i in play_elements:

    ########
    # PUNT #
    ########

    # All data needed for first row in replacement dataframe
    punt = re.findall(punting_pattern, i)
    if len(punt) > 0:
      df_punt['PlayType'] = 'Punt'
      df_punt['PlayDescription'] = i
      df_punt['Kicker'] = punt[0][0]
      df_punt['Yardage'] = int(punt[0][1])
      df_punt['LongSnapper'] = punt[0][2]
      # Touchback
      if i.find('Touchback') != -1:
        df_punt['PlayOutcome'] = 'Touchback'
        continue
      # Out of bounds
      if i.find('out of bounds') != -1:
        df_punt['PlayOutcome'] = 'out of bounds'
        continue
      # Downed by
      if i.find('downed by') != -1:
        df_punt['PlayOutcome'] = 'downed'
        downed_by = re.findall(kick_downed_by_pattern, i)
        df_punt['DownedBy'] = downed_by[0][downed_by[0].find("-")+1:] # Need to get abreviation of team name away from player name (e.g. IND-G.Stuard)
        continue
      # fair catch
      if i.find('fair catch') != -1:
        df_punt['PlayOutcome'] = 'fair catch'
        fair_catch_by = re.findall(punt_fair_catch_pattern, i)
        df_punt['Returner'] = fair_catch_by[0]
        continue
      continue

    ######################################
    # PUNT RETURN (Including touchdowns) #
    ######################################

    # All data needed for the second row within replacement dataframe
    # - Second row only needed when there is a punt return for yardage
    # - I think I am going to run into trouble if there is a fumble recovery for yardage
    punt_return_patterns = [punt_return_pattern, touchdown_after_takeaway_pattern]
    for return_pattern in punt_return_patterns:
      punt_return = re.findall(return_pattern, i)
      if len(punt_return) > 0:
        df_punt_return['PlayDescription'] = i
        df_punt_return['PlayOutcome'] = 'Run'
        df_punt_return = clean_run_plays(df_punt_return)
        df_punt_return['PlayOutcome'] = row_copy['PlayOutcome']
        df_punt_return['PlayType'] = 'Punt Return'
        df_punt_return['Rusher'] = 'nan'
        df_punt_return['Returner'] = punt_return[0]
        break

    #############
    #  PENALTY  #
    #############

    # Accepted Penalty
    if i.find('PENALTY') != -1:
      accepted_penalties.append(i)

    # Declined Penalty
    if i.find('Penalty') != -1:
      declined_penalties.append(i)

    # If playoutcome is the same as the original play, then run second sentence through
    # run cleaning method.

  if len(accepted_penalties) > 0:
    df_punt.at[0, 'AcceptedPenalty'] = accepted_penalties
  if len(declined_penalties) > 0:
    df_punt.at[0, 'DeclinedPenalty'] = declined_penalties

  #############################
  # NEW REPLACEMENT DATAFRAME #
  #############################

  if df_punt_return['PlayDescription'].iloc[0] == 'nan':
    df_replacement_rows = df_punt
  else:
    df_replacement_rows = pd.concat([df_punt, df_punt_return], ignore_index=True)

  df_before_row = df_plays.iloc[:df_plays.index.tolist().index(idx)]
  df_after_row = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
  df_plays = pd.concat([df_before_row, df_replacement_rows, df_after_row], ignore_index=True)

  if df_punt_plays.tail(1).index.tolist()[0] == idx:
    return df_plays
  else:
    return clean_punt_plays(df_plays, idx+len(df_replacement_rows))

#### KICKOFFS

In [29]:
# A kickoff playtype will be split into 1 or more rows

# I need to figure out an onside kick (recovered by kicking team)
# I need to figure out fumbled kickoff returns
# I need to figure out returns for a touchdown
# injuries?

# Method can mirror punts method.

def clean_kickoff_plays(df_plays, index_start = None):

  # Will cut df_plays starting from index_start (narrowing our search space)
  if index_start != None:
    df_plays_adjusted = df_plays.loc[index_start:]
    df_kickoff_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('kickoff', case=False)]
  else:
    df_kickoff_plays = df_plays[df_plays['PlayOutcome'].str.contains('kickoff', case=False)]

  # exit case
  if df_kickoff_plays.empty:
    return df_plays

  # Retrieve the index and 'PlayDescription' of the first kickoff play in 'df_kickoff_plays'
  # - Process one play per iteration in the recursive method
  idx = df_kickoff_plays.index[0]
  play = df_plays['PlayDescription'].loc[idx]

  ############
  # REVERSES #
  ############

  # In 'PlayDescription' all information before the "reversed" sentence is not needed.
  # - All information before is stored within 'ReverseDetails' and the remaining is cleaned.
  if play.find('REVERSED') != -1:
    play_elements = play.split(". ")
    for i in play_elements:
      if i.find("REVERSED") != -1:
        df_plays.at[idx, 'ReverseDetails'] = play_elements[:play_elements.index(i) + 1]
        play = ". ".join(play_elements[play_elements.index(i) + 1:])
        break

  ###########
  # FUMBLES #
  ###########

  if play.find('FUMBLES') != -1:
    main_action_patterns = [kickoff_pattern, kick_return_pattern, defensive_takeaway_run_pattern, handoff_pattern]
    main_cleaning_method = clean_kickoff_plays
    df_replacement_rows = extract_fumble_data(df_plays, play, idx,
                                              main_action_patterns,
                                              main_cleaning_method)

    df_before = df_plays.iloc[:df_plays.index.tolist().index(idx)]
    df_after = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
    df_plays = pd.concat([df_before, df_replacement_rows, df_after], ignore_index=True)
    index_of_last_added_row = idx + len(df_replacement_rows) - 1

    # returning row after the last index
    if df_kickoff_plays.tail(1).index.tolist()[0] == idx:
      return df_plays
    else:
      return clean_run_plays(df_plays, index_of_last_added_row + 1)

  # Create 2 single row dataframes.
  # 1. The Kickoff
  df_kickoff = df_plays.loc[idx].copy()
  df_kickoff = pd.DataFrame([df_kickoff], columns=df_plays.columns)
  df_kickoff.reset_index(drop=True, inplace=True)
  df_kickoff['PlayDescription'] = 'nan'
  # 2. The Kickoff Return
  df_kickoff_return = df_plays.loc[idx].copy()
  df_kickoff_return = pd.DataFrame([df_kickoff_return], columns=df_plays.columns)
  df_kickoff_return.reset_index(drop=True, inplace=True)
  df_kickoff_return['PlayDescription'] = 'nan'

  # break down play by sentences.
  play_elements = play.split(". ")

  accepted_penalties = []
  declined_penalties = []

  for i in play_elements:

    ###########
    # KICKOFF #
    ###########

    kickoff = re.findall(kickoff_pattern, i)
    if len(kickoff) > 0:
      df_kickoff['PlayType'] = 'Kickoff'
      df_kickoff['PlayDescription'] = i
      df_kickoff['Kicker'] = kickoff[0][0]
      df_kickoff['Yardage'] = int(kickoff[0][1])
      if i.find('Touchback') != -1:
        df_kickoff['PlayOutcome'] = 'Touchback'
        continue
      # I need to figure out what the difference will be when the kicking team recovers
      if i.find('onside') != -1:
        df_kickoff['PlayOutcome'] = 'onside'
        downed_by = re.findall(kick_downed_by_pattern, i)
        if len(downed_by) > 0:
          df_kickoff['DownedBy'] = downed_by[0][downed_by[0].find("-")+1:]
        continue
      continue

    #########################################
    # KICKOFF RETURN (Including touchdowns) #
    #########################################

    kick_return_patterns = [kick_return_pattern, touchdown_after_takeaway_pattern]
    for return_pattern in kick_return_patterns:
      kick_return = re.findall(return_pattern, i)
      if len(kick_return) > 0:
        df_kickoff_return['PlayDescription'] = i
        df_kickoff_return['PlayOutcome'] = 'Run'
        df_kickoff_return = clean_run_plays(df_kickoff_return)
        df_kickoff_return['PlayOutcome'] = df_plays['PlayOutcome'].loc[idx]
        df_kickoff_return['PlayType'] = 'Kickoff Return'
        df_kickoff_return['Rusher'] = 'nan'
        df_kickoff_return['Returner'] = kick_return[0][0] # I think this will be a problem once I get a dataset with kick return touchdowns
        break

    #############
    #  PENALTY  #
    #############

    # Accepted Penalty
    if i.find('PENALTY') != -1:
      accepted_penalties.append(i)

    # Declined Penalty
    if i.find('Penalty') != -1:
      declined_penalties.append(i)

    # If playoutcome is the same as the original play, then run second sentence through
    # run cleaning method.

  if len(accepted_penalties) > 0:
    df_kickoff.at[0, 'AcceptedPenalty'] = accepted_penalties
  if len(declined_penalties) > 0:
    df_kickoff.at[0, 'DeclinedPenalty'] = declined_penalties

  #############################
  # NEW REPLACEMENT DATAFRAME #
  #############################

  if df_kickoff_return['PlayDescription'].iloc[0] == 'nan':
    df_replacement_rows = df_kickoff
  else:
    df_replacement_rows = pd.concat([df_kickoff, df_kickoff_return], ignore_index=True)

  df_before_row = df_plays.iloc[:df_plays.index.tolist().index(idx)]
  df_after_row = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
  df_plays = pd.concat([df_before_row, df_replacement_rows, df_after_row], ignore_index=True)

  if df_kickoff_plays.tail(1).index.tolist()[0] == idx:
    return df_plays
  else:
    return clean_kickoff_plays(df_plays, idx+len(df_replacement_rows))

###SCORING CLEANING METHODS

#### TOUCHDOWNS

In [30]:
# Still need to figure out whether or not plays that have multiple rows will all have
# 'IsScoringDrive' = 1, 'IsScoringDrive' = 1, 'PlayOutcome' = *teamname* Touchdown
# - The reasoning to not have this is because if a qb was to throw a pick 6,
#   it wouldn't count as a "Scoring Drive" for them but the opposing team.
# - For consistency, I will have the entire play have
#   'IsScoringDrive' = 1, 'IsScoringDrive' = 1, 'PlayOutcome' = *teamname* Touchdown
# - Need larger dataset to include all other touchdown plays such as kickoff returns and field goal returns

def clean_touchdown_plays(df_plays, index_start=None):

  # Cut 'df_plays' to begin from 'index_start' to the last touchdown play available in dataframe
  if index_start != None:
    df_plays_adjusted = df_plays.loc[index_start:]
    df_touchdown_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('Touchdown')]
  else:
    df_touchdown_plays = df_plays[df_plays['PlayOutcome'].str.contains('Touchdown')]

  # Iterating through every touchdown play within 'df_touchdown_plays'
  for idx, play in df_touchdown_plays['PlayDescription'].items():

    # - Once i figure out what kind of touchdown it was, then I will be able to
    #   determine the 'PlayType'

    ######################
    # PASSING TOUCHDOWNS #
    ######################

    # If a play has a passer throwing the ball, I am assuming it is a passing play
    passing_play = re.findall(passer_name_pattern, play)
    if len(passing_play) > 0 and play.find("sacked") == -1 and play.find("INTERCEPTED") == -1:

      # creating a copy of the passing touchdown play row and cleaning the copy
      passing_touchdown_row = df_plays.loc[idx].copy()
      passing_touchdown_row['PlayType'] = 'Pass'
      passing_touchdown_row['PlayOutcome'] = 'Pass'
      passing_touchdown_row['IsScoringPlay'] = 1
      passing_touchdown_row = pd.DataFrame([passing_touchdown_row], columns=df_plays.columns)
      passing_touchdown_row = clean_pass_plays(passing_touchdown_row)
      passing_touchdown_row['PlayOutcome'] = df_plays['PlayOutcome'].loc[idx]

      # Replacing old row with cleaned row
      df_before_row = df_plays.iloc[:df_plays.index.tolist().index(idx)]
      df_after_row = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
      df_plays = pd.concat([df_before_row, passing_touchdown_row, df_after_row], ignore_index=True)

      # Recursion to update 'df_plays'
      if df_touchdown_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_touchdown_plays(df_plays, idx+len(passing_touchdown_row))

    ######################
    # RUSHING TOUCHDOWNS #
    ######################

    # Rusher
    rusher_patterns = [rusher_pattern, defensive_takeaway_run_pattern]
    # Loop through patterns and find the first match
    for pattern in rusher_patterns:
      rusher = re.findall(pattern, play)
      if len(rusher) > 0:
        # creating a copy of the rushing touchdown play row and cleaning the copy
        rushing_touchdown_row = df_plays.loc[idx].copy()
        rushing_touchdown_row['PlayType'] = 'Run'
        rushing_touchdown_row['PlayOutcome'] = 'Run'
        rushing_touchdown_row['IsScoringPlay'] = 1
        rushing_touchdown_row = pd.DataFrame([rushing_touchdown_row], columns=df_plays.columns)
        rushing_touchdown_row = clean_run_plays(rushing_touchdown_row)
        rushing_touchdown_row['PlayOutcome'] = df_plays['PlayOutcome'].loc[idx]

        # Replacing old row with cleaned row
        df_before_row = df_plays.iloc[:df_plays.index.tolist().index(idx)]
        df_after_row = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
        df_plays = pd.concat([df_before_row, rushing_touchdown_row, df_after_row], ignore_index=True)

        # Recursion to update 'df_plays'
        if df_touchdown_plays.tail(1).index.tolist()[0] == idx:
          return df_plays
        else:
          return clean_touchdown_plays(df_plays, idx+len(rushing_touchdown_row))

    ##########################
    # INTERCEPTED TOUCHDOWNS #
    ##########################

    # Still need to clean intercepted play types
    if play.find("INTERCEPTED") != -1:

      # creating a copy of the incercepted touchdown play and cleaning the copy
      intercepted_touchdown_row = df_plays.loc[idx].copy()
      intercepted_touchdown_row['PlayOutcome'] = 'Interception'
      intercepted_touchdown_row['IsScoringPlay'] = 1 # This will only be the value for the team that threw the interception
      intercepted_touchdown_row = pd.DataFrame([intercepted_touchdown_row], columns=df_plays.columns)
      intercepted_touchdown_row.reset_index(drop=True, inplace=True)
      intercepted_touchdown_row = clean_intercepted_plays(intercepted_touchdown_row)
      intercepted_touchdown_row['PlayOutcome'] = df_plays['PlayOutcome'].loc[idx]

      #################################################################################################### Under Construction
      # Change feature 'TeamWithPossession' for each play in drive
      # - Raw data states that the team that intercepted the ball for a touchdown had possession for each play
      #   within drive. The correct value for this feature for each play in drive is the team that threw
      #   the interception.

      wrong_team_with_possession = df_plays['TeamWithPossession'].loc[idx]
      if wrong_team_with_possession == dict_teams.get(df_plays['HomeTeam'].loc[idx]):
        correct_team_with_possession = dict_teams.get(df_plays['AwayTeam'].loc[idx])
      else:
        correct_team_with_possession = dict_teams.get(df_plays['HomeTeam'].loc[idx])

      # HERE I NEED TO CHANGE ALL 'TEAMWITHPOSSESSION' FEATURES FOR EVERY PLAY IN DRIVE
      # I need to figure out how to efficiently grab every play in drive.
      intercepted_touchdown_row['TeamWithPossession'] = correct_team_with_possession
      conditions_for_unique_drive = ((df_plays['Season'] == df_plays['Season'].loc[idx]) &
      (df_plays['Week'] == df_plays['Week'].loc[idx]) &
      (df_plays['AwayTeam'] == df_plays['AwayTeam'].loc[idx]) &
      (df_plays['HomeTeam'] == df_plays['HomeTeam'].loc[idx]) &
      (df_plays['Quarter'] == df_plays['Quarter'].loc[idx]) &
      (df_plays['DriveNumber'] == df_plays['DriveNumber'].loc[idx]))

      df_plays.loc[conditions_for_unique_drive, 'TeamWithPossession'] = correct_team_with_possession

      ####################################################################################################

      # Replacing old row with cleaned row
      df_before_row = df_plays.iloc[:df_plays.index.tolist().index(idx)]
      df_after_row = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
      df_plays = pd.concat([df_before_row, intercepted_touchdown_row, df_after_row], ignore_index=True)

      # Recursion to update 'df_plays'
      if df_touchdown_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_touchdown_plays(df_plays, idx+len(intercepted_touchdown_row))

    #####################################
    # SACKED FUMBLE RECOVERY TOUCHDOWNS #
    #####################################

    if play.find("sacked") != -1:
      # print(idx)

      # creating a copy of the sack touchdown play and cleaning the copy
      sacked_touchdown_row = df_plays.loc[idx].copy()
      sacked_touchdown_row['PlayOutcome'] = 'Sack'
      sacked_touchdown_row['IsScoringPlay'] = 1
      sacked_touchdown_row = pd.DataFrame([sacked_touchdown_row], columns=df_plays.columns)
      sacked_touchdown_row.reset_index(drop=True, inplace=True)
      sacked_touchdown_row = clean_sacked_plays(sacked_touchdown_row)
      sacked_touchdown_row['PlayOutcome'] = df_plays['PlayOutcome'].loc[idx]

      #################################################################################################### Under Construction
      # Change feature 'TeamWithPossession' for each play in drive
      # - Raw data states that the team that recovered the ball for a touchdown had possession for each play
      #   within drive. The correct value for this feature for each play in drive is the team that fumbled.

      wrong_team_with_possession = df_plays['TeamWithPossession'].loc[idx]
      if wrong_team_with_possession == dict_teams.get(df_plays['HomeTeam'].loc[idx]):
        correct_team_with_possession = dict_teams.get(df_plays['AwayTeam'].loc[idx])
      else:
        correct_team_with_possession = dict_teams.get(df_plays['HomeTeam'].loc[idx])

      # HERE I NEED TO CHANGE ALL 'TEAMWITHPOSSESSION' FEATURES FOR EVERY PLAY IN DRIVE
      # I need to figure out how to efficiently grab every play in drive.
      sacked_touchdown_row['TeamWithPossession'] = correct_team_with_possession
      conditions_for_unique_drive = ((df_plays['Season'] == df_plays['Season'].loc[idx]) &
      (df_plays['Week'] == df_plays['Week'].loc[idx]) &
      (df_plays['AwayTeam'] == df_plays['AwayTeam'].loc[idx]) &
      (df_plays['HomeTeam'] == df_plays['HomeTeam'].loc[idx]) &
      (df_plays['Quarter'] == df_plays['Quarter'].loc[idx]) &
      (df_plays['DriveNumber'] == df_plays['DriveNumber'].loc[idx]))

      df_plays.loc[conditions_for_unique_drive, 'TeamWithPossession'] = correct_team_with_possession

      ####################################################################################################

      # Replacing old row with cleaned row (Original row can sometimes be replaced with multiple rows)
      df_before_row = df_plays.iloc[:df_plays.index.tolist().index(idx)]
      df_after_row = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
      df_plays = pd.concat([df_before_row, sacked_touchdown_row, df_after_row], ignore_index=True)

      # Recursion to update 'df_plays'
      if df_touchdown_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_touchdown_plays(df_plays, idx+len(sacked_touchdown_row))

    ##########################
    # PUNT RETURN TOUCHDOWNS #
    ##########################

    punt_play = re.findall(punting_pattern, play)
    if len(punt_play) > 0:

      # creating a copy of the punt touchdown play and cleaning the copy
      punt_touchdown_row = df_plays.loc[idx].copy()
      punt_touchdown_row['PlayOutcome'] = 'Punt'
      punt_touchdown_row['IsScoringPlay'] = 1 # This will only be the value for the team that punted the ball
      punt_touchdown_row = pd.DataFrame([punt_touchdown_row], columns=df_plays.columns)
      punt_touchdown_row.reset_index(drop=True, inplace=True)
      punt_touchdown_row = clean_punt_plays(punt_touchdown_row)
      punt_touchdown_row['PlayOutcome'] = df_plays['PlayOutcome'].loc[idx]

      # Replacing old row with cleaned row
      df_before_row = df_plays.iloc[:df_plays.index.tolist().index(idx)]
      df_after_row = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
      df_plays = pd.concat([df_before_row, punt_touchdown_row, df_after_row], ignore_index=True)

      # Recursion to update 'df_plays'
      if df_touchdown_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_touchdown_plays(df_plays, idx+len(punt_touchdown_row))

    #################################
    # BLOCKED FIELD GOAL TOUCHDOWNS #
    #################################

    field_goal_blocked = re.findall(field_goal_blocked_pattern, play)
    if len(field_goal_blocked) > 0:
      print(idx)

      # creating a copy of recovered blocked field goal touchdown play and cleaning the copy
      blocked_fg_touchdown_row = df_plays.loc[idx].copy()
      blocked_fg_touchdown_row['PlayOutcome'] = 'Field Goal'
      blocked_fg_touchdown_row['IsScoringPlay'] = 1 # This will only be the value for the team that attempted the field goal
      blocked_fg_touchdown_row = pd.DataFrame([blocked_fg_touchdown_row], columns=df_plays.columns)
      blocked_fg_touchdown_row.reset_index(drop=True, inplace=True)
      blocked_fg_touchdown_row = clean_field_goal_plays(blocked_fg_touchdown_row)
      blocked_fg_touchdown_row['PlayOutcome'] = df_plays['PlayOutcome'].loc[idx]

      #################################################################################################### Under Construction
      # Change feature 'TeamWithPossession' for each play in drive
      # - Raw data states that the team that blocked the field goal attempt and recovered for a touchdown had possession for each play
      #   within drive. The correct value for this feature for each play in drive is the team that threw
      #   the interception.

      wrong_team_with_possession = df_plays['TeamWithPossession'].loc[idx]
      if wrong_team_with_possession == dict_teams.get(df_plays['HomeTeam'].loc[idx]):
        correct_team_with_possession = dict_teams.get(df_plays['AwayTeam'].loc[idx])
      else:
        correct_team_with_possession = dict_teams.get(df_plays['HomeTeam'].loc[idx])

      # HERE I NEED TO CHANGE ALL 'TEAMWITHPOSSESSION' FEATURES FOR EVERY PLAY IN DRIVE
      # I need to figure out how to efficiently grab every play in drive.
      blocked_fg_touchdown_row['TeamWithPossession'] = correct_team_with_possession
      conditions_for_unique_drive = ((df_plays['Season'] == df_plays['Season'].loc[idx]) &
      (df_plays['Week'] == df_plays['Week'].loc[idx]) &
      (df_plays['AwayTeam'] == df_plays['AwayTeam'].loc[idx]) &
      (df_plays['HomeTeam'] == df_plays['HomeTeam'].loc[idx]) &
      (df_plays['Quarter'] == df_plays['Quarter'].loc[idx]) &
      (df_plays['DriveNumber'] == df_plays['DriveNumber'].loc[idx]))

      df_plays.loc[conditions_for_unique_drive, 'TeamWithPossession'] = correct_team_with_possession

      ####################################################################################################

      # Replacing old row with cleaned row
      df_before_row = df_plays.iloc[:df_plays.index.tolist().index(idx)]
      df_after_row = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
      df_plays = pd.concat([df_before_row, blocked_fg_touchdown_row, df_after_row], ignore_index=True)

      # Recursion to update 'df_plays'
      if df_touchdown_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_touchdown_plays(df_plays, idx+len(blocked_fg_touchdown_row))

#### FIELD GOALS

In [70]:
# I need an example of when a player returns the field goal for yardage
# I need a larger sample size for "Blocked" field goals
# I need to figure out what to do if someone fumbles a recovery
# I need to figure out what to do on a trick play (e.i. holder runs out with the ball)
# - INCOMPLETE. NEED LARGER SAMPLE SIZE

def clean_field_goal_plays(df_plays, index_start = None):

  # Adjusting df_plays to start cleaning at a specified index (index_start)
  if index_start != None:
    df_plays_adjusted = df_plays.loc[index_start:]
    # Locating all field goal plays within dataframe
    df_field_goal_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('Field Goal')]
  else:
    # Locating all field goal plays within dataframe
    df_field_goal_plays = df_plays[df_plays['PlayOutcome'].str.contains('Field Goal')]

  for idx, play in df_field_goal_plays['PlayDescription'].items():

    play_elements = play.split(". ")

    ###################
    # EXTRA PLAY DATA #
    ###################

    # I may have the change this later.
    # I think I will have to move this towards the end.

    if len(play_elements) > 1:

      accepted_penalties = []
      declined_penalties = []
      injured_players = []

      for i in play_elements:

        # Accepted Penalty
        if i.find('PENALTY') != -1:
          accepted_penalties.append(i)

        # Declined Penalty
        if i.find('Penalty') != -1:
          declined_penalties.append(i)

        # Injuries
        injury_on_play = re.findall(injury_pattern, i)
        if len(injury_on_play) > 0:
          injured_players.append(injury_on_play[0])

      if len(accepted_penalties) > 0:
        df_plays.at[idx, 'AcceptedPenalty'] = accepted_penalties
      if len(declined_penalties) > 0:
        df_plays.at[idx, 'DeclinedPenalty'] = declined_penalties
      if len(injured_players) > 0:
        df_plays.at[idx, 'InjuredPlayers'] = injured_players

    # Time of play
    time_on_clock = re.findall(time_on_clock_pattern, play)
    if len(time_on_clock) > 0:
      df_plays.loc[idx, 'TimeOnTheClock'] = time_on_clock[0]

    #########################
    # FIELD GOAL SITUATIONS #
    #########################

    # Field goal good
    field_goal_good = re.findall(field_goal_good_pattern, play)
    if len(field_goal_good) > 0:
      df_plays.loc[idx, 'PlayOutcome'] = 'Field Goal Good'
      df_plays.loc[idx, 'PlayType'] = 'Field Goal'
      df_plays.loc[idx, 'Kicker'] = field_goal_good[0][0]
      df_plays.loc[idx, 'Yardage'] = int(field_goal_good[0][1])
      df_plays.loc[idx, 'LongSnapper'] = field_goal_good[0][2]
      df_plays.loc[idx, 'Holder'] = field_goal_good[0][3]
      continue

    # Field goal no good
    field_goal_no_good = re.findall(field_goal_no_good_pattern, play)
    if len(field_goal_no_good) > 0:
      df_plays.loc[idx, 'PlayOutcome'] = 'Field Goal No Good'
      df_plays.loc[idx, 'PlayType'] = 'Field Goal'
      df_plays.loc[idx, 'Kicker'] = field_goal_no_good[0][0]
      df_plays.loc[idx, 'Yardage'] = int(field_goal_no_good[0][1])
      df_plays.loc[idx, 'Direction'] = field_goal_no_good[0][2]
      df_plays.loc[idx, 'LongSnapper'] = field_goal_no_good[0][3]
      df_plays.loc[idx, 'Holder'] = field_goal_no_good[0][4]
      continue

    # - Going to treat this as a fumble. Will use helper method 'extract_fumble_data'
    # - Should I create a feature for those who recovered the ball?

    # Field goal blocked
    # I NEED A LARGER SAMPLE SIZE TO CORRECTLY CLEAN THESE
    field_goal_blocked = re.findall(field_goal_blocked_pattern, play)
    if len(field_goal_blocked) > 0:

      # What should I do here?
      # - Should I clean the main play here and send everything else
      #   to the fumble helper method?
      #   - I think I am going to try this. I think it will work.
      # - I will have to grab all sentences leading up to main field goal sentence.
      # - I will have to grab all sentences after and send them to the fumble helper method.
      #   - main_cleaning_method parameter will be the clean_run_plays.

      # 1. check to see if play has more than 1 sentence.
      # 2. create if statement (if more than 1):
      #    a. find where initial field goal attempt is
      #    b. wrap all sentences following initial field goal attempt together.
      #    c. clean wrapped sentence together using 'extract_fumble_data'
      #       - use 'clean_run_plays' method for this.
      #    NOTE: at this point, we should now have the back half of the play cleaned
      #          and inside a df. We now need to clean the front half and attach to
      #          the back half and replace original play df row with this new df set of rows.
      #           - Maybe I can replace the contents of the original row and add the cleaned back half?
      #             - I like this idea.

      # STEPS:
      # 1. if play is multiple sentences (after field goal attempt?)
      #    a. find all sentences following the sentence explaining field goal attempt.
      #    b. clean back half of play
      #    c. add cleaned back half to original df of plays (after original play)
      # 2. clean first half of play (replace original play row)
      #    - I think I have to clean first half of play first.

      # I think I will only use "extract fumble data" if there is actually a fumble
      # - I think I should use "clean_fumble_run" data to clean a normal recovery for yardage or touchdown.

      # GOALS:
      # - Clean normal attempted field goal that was blocked
      # - create split for recovery for yardage
      #   - regular yardage
      #   - yardage for touchdown
      # - check for fumbles here?

      # STEPS:
      # 1. separate play
      #    a. field goal attempt that was blocked
      #       - I think I may want extra data here? such as penalties and injuries?
      #    b. everything that follows after (should only be 1 sentence without fumble or handoff)
      # 2. if there is more than the field goal attempt:
      #    - isolate field goal attempt
      #      - clean field goal attempt here?
      #    - clean yardage after recovery
      #      - should i look for fumbles here?
      #    - combine cleaned actions
      #    - put into original plays dataframe
      #    - recursion from there
      # 3. clean field goal attempt

      # Need to locate field goal attempt within play description
      play_elements = play.split(". ")
      if len(play_elements) > 1:
        for i in play_elements:
          # Locating which sentence contains the field goal attempt
          field_goal_blocked = re.findall(field_goal_blocked_pattern, i)
          if len(field_goal_blocked) > 0:
            # Grabbing field goal attempt that was blocked
            field_goal_attempt = i

            # Grabbing all actions that followed the field goal attempt (should be things such as recovery for yardage, fumbles, recovery for touchdown, etc.)
            field_goal_blocked_recovery_actions = play_elements[play_elements.index(i)+1::]
            field_goal_blocked_recovery_actions = ". ".join(field_goal_blocked_recovery_actions)

            # create empty dataframe with recovery data as 'PlayDescription'
            # - probably should make this a copy of the original play to keep consistent
            df_recovery_yardage_rows = df_plays.loc[idx].copy()
            df_recovery_yardage_rows['PlayDescription'] = field_goal_blocked_recovery_actions
            df_recovery_yardage_rows['PlayOutcome'] = 'Run'
            df_recovery_yardage_rows = clean_run_plays(df_recovery_yardage_rows)


            # LEFT OFF HERE.


            # combine cleaned actions
            # put into original plays dataframe
            # recursion
            df_before = df_plays.iloc[:df_plays.index.tolist().index(idx)+1]
            df_after = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
            df_plays = pd.concat([df_before, df_add_on_rows, df_after], ignore_index=True)
            index_of_last_added_row = idx + len(df_add_on_rows)

            if df_field_goal_plays.tail(1).index.tolist()[0] == idx:
              return df_plays
            else:
              return clean_field_goal_plays(df_plays, index_of_last_added_row + 1)

            break

      # if play.lower().find('recovered') != -1:
      #   main_action_patterns = [field_goal_blocked_pattern, defensive_takeaway_run_pattern, touchdown_after_takeaway_pattern]
      #   main_cleaning_method = clean_field_goal_plays
      #   df_replacement_rows = extract_fumble_data(df_plays, play, idx,
      #                                             main_action_patterns,
      #                                             main_cleaning_method)


      #   # Use this when initial play and following actions have been cleaned.
      #   df_before = df_plays.iloc[:df_plays.index.tolist().index(idx)]
      #   df_after = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
      #   df_plays = pd.concat([df_before, df_replacement_rows, df_after], ignore_index=True)
      #   index_of_last_added_row = idx + len(df_replacement_rows) - 1

        if df_field_goal_plays.tail(1).index.tolist()[0] == idx:
          return df_plays
        else:
          return clean_field_goal_plays(df_plays, index_of_last_added_row + 1)

      df_plays.loc[idx, 'PlayOutcome'] = 'Field Goal Blocked'
      df_plays.loc[idx, 'PlayType'] = 'Field Goal'
      df_plays.loc[idx, 'Kicker'] = field_goal_blocked[0][0]
      df_plays.loc[idx, 'Yardage'] = int(field_goal_blocked[0][1])
      df_plays.loc[idx, 'BlockedBy'] = field_goal_blocked[0][2]
      df_plays.loc[idx, 'LongSnapper'] = field_goal_blocked[0][3]
      df_plays.loc[idx, 'Holder'] = field_goal_blocked[0][4]
      continue

  return df_plays

####EXTRA POINT

In [32]:
def clean_extra_point_plays(df_plays, index_start = None):

  # Adjusting df_plays to start cleaning at a specified index (index_start)
  if index_start != None:
    df_plays_adjusted = df_plays.loc[index_start:]
    # Locating all extra point plays within dataframe
    df_extra_point_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('Extra Point')]
  else:
    # Locating all extra point plays within dataframe
    df_field_goal_plays = df_plays[df_plays['PlayOutcome'].str.contains('Extra Point')]

  for idx, play in df_field_goal_plays['PlayDescription'].items():

    play_elements = play.split(". ")

    ###################
    # EXTRA PLAY DATA #
    ###################

    if len(play_elements) > 1:

      accepted_penalties = []
      declined_penalties = []
      injured_players = []

      for i in play_elements:

        # Accepted Penalty
        if i.find('PENALTY') != -1:
          accepted_penalties.append(i)

        # Declined Penalty
        if i.find('Penalty') != -1:
          declined_penalties.append(i)

        # Injuries
        injury_on_play = re.findall(injury_pattern, i)
        if len(injury_on_play) > 0:
          injured_players.append(injury_on_play[0])

      if len(accepted_penalties) > 0:
        df_plays.at[idx, 'AcceptedPenalty'] = accepted_penalties
      if len(declined_penalties) > 0:
        df_plays.at[idx, 'DeclinedPenalty'] = declined_penalties
      if len(injured_players) > 0:
        df_plays.at[idx, 'InjuredPlayers'] = injured_players

    ##########################
    # EXTRA POINT SITUATIONS #
    ##########################

    # Extra point good
    extra_point_good = re.findall(extra_point_good_pattern, play)
    if len(extra_point_good) > 0:
      df_plays.loc[idx, 'PlayOutcome'] = 'Extra Point Good'
      df_plays.loc[idx, 'PlayType'] = 'Extra Point'
      df_plays.loc[idx, 'Kicker'] = extra_point_good[0][0]
      df_plays.loc[idx, 'LongSnapper'] = extra_point_good[0][1]
      df_plays.loc[idx, 'Holder'] = extra_point_good[0][2]
      continue

    # Extra point no good
    extra_point_no_good = re.findall(extra_point_no_good_pattern, play)
    if len(extra_point_no_good) > 0:
      df_plays.loc[idx, 'PlayOutcome'] = 'Extra Point No Good'
      df_plays.loc[idx, 'PlayType'] = 'Extra Point'
      df_plays.loc[idx, 'Kicker'] = extra_point_no_good[0][0]
      df_plays.loc[idx, 'Direction'] = extra_point_no_good[0][1]
      df_plays.loc[idx, 'LongSnapper'] = extra_point_no_good[0][2]
      df_plays.loc[idx, 'Holder'] = extra_point_no_good[0][3]
      continue

  return df_plays

###OTHER CLEANING METHODS

#### FUMBLE PLAYS

In [33]:
# What about punt returns?
# Might need more data on 'Aborted' fumbled plays. Currently it does not show who fumbled the ball.

def clean_fumble_plays(df_plays, index_start=None):

  # Cut 'df_plays' to begin from 'index_start' to the last penalty play available in dataframe
  if index_start != None:
    df_plays_adjusted = df_plays.loc[index_start:]
    df_fumble_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('fumble', case=False)]
  else:
    df_fumble_plays = df_plays[df_plays['PlayOutcome'].str.contains('fumble', case=False)]

  for idx, play in df_fumble_plays['PlayDescription'].items():

    ############
    # REVERSES #
    ############

    # In 'PlayDescription' all information before the "reversed" sentence is not needed.
    # - All information before is stored within 'ReverseDetails' and the remaining is cleaned.
    if play.find('REVERSED') != -1:
      play_elements = play.split(". ")
      for i in play_elements:
        if i.find("REVERSED") != -1:
          df_plays.at[idx, 'ReverseDetails'] = play_elements[:play_elements.index(i) + 1]
          play = ". ".join(play_elements[play_elements.index(i) + 1:])
          break

    ############################
    # REPORTING IN AS ELIGIBLE #
    ############################

    # I do not think this contains any useful data so I am going to exclude it.
    if play.find('reported in as eligible') != -1:
      play_elements = play.split(". ")
      for i in play_elements:
        if i.find('reported in as eligible') != -1:
          play = ". ".join(play_elements[play_elements.index(i) + 1:])
          break

    initial_action = play.split(". ")[0]

    ##################
    # PASSING FUMBLE #
    ##################

    fumble_pass = re.findall(receiver_pattern, initial_action)
    if len(fumble_pass) > 0:

      # creating a copy of the passing fumbled play row and cleaning the copy
      passing_fumble_row = df_plays.loc[idx].copy()
      passing_fumble_row['PlayOutcome'] = 'Pass'
      passing_fumble_row = pd.DataFrame([passing_fumble_row], columns=df_plays.columns)
      passing_fumble_row = clean_pass_plays(passing_fumble_row)
      # passing_fumble_row['PlayOutcome'] = df_plays['PlayOutcome'].loc[idx]

      # Record whether the pass was complete or incomplete.
      if play.find('pass incomplete') != -1:
        passing_fumble_row['PlayOutcome'] = f"{df_plays['PlayOutcome'].loc[idx]} (I)"
      else:
        passing_fumble_row['PlayOutcome'] = f"{df_plays['PlayOutcome'].loc[idx]} (C)"

      # Replacing old row with cleaned row
      df_before_row = df_plays.iloc[:df_plays.index.tolist().index(idx)]
      df_after_row = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
      df_plays = pd.concat([df_before_row, passing_fumble_row, df_after_row], ignore_index=True)

      # Recursion to update 'df_plays'
      if df_fumble_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_fumble_plays(df_plays, idx+len(passing_fumble_row))

    ##################
    # RUSHING FUMBLE #
    ##################

    fumble_rush = re.findall(rusher_pattern, initial_action)
    qb_fumble = re.findall(qb_fumble_pattern, initial_action)
    fumble_aborted = initial_action.find('Aborted')
    if len(fumble_rush) > 0 or fumble_aborted != -1 or len(qb_fumble) > 0:

      # creating a copy of the rushing fumbled play row and cleaning the copy
      rushing_fumble_row = df_plays.loc[idx].copy()
      rushing_fumble_row['PlayOutcome'] = 'Run'
      rushing_fumble_row = pd.DataFrame([rushing_fumble_row], columns=df_plays.columns)
      rushing_fumble_row = clean_run_plays(rushing_fumble_row)
      rushing_fumble_row['PlayOutcome'] = df_plays['PlayOutcome'].loc[idx]

      # Replacing old row with cleaned row
      df_before_row = df_plays.iloc[:df_plays.index.tolist().index(idx)]
      df_after_row = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
      df_plays = pd.concat([df_before_row, rushing_fumble_row, df_after_row], ignore_index=True)

      # Recursion to update 'df_plays'
      if df_fumble_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_fumble_plays(df_plays, idx+len(rushing_fumble_row))

    #################
    # SACKED FUMBLE #
    #################

    if initial_action.find('sacked') != -1:

      # creating a copy of the sacked fumble play row and cleaning the copy
      sacked_fumble_row = df_plays.loc[idx].copy()
      sacked_fumble_row['PlayOutcome'] = 'Sack'
      sacked_fumble_row = pd.DataFrame([sacked_fumble_row], columns=df_plays.columns)
      sacked_fumble_row = clean_sacked_plays(sacked_fumble_row)
      sacked_fumble_row['PlayOutcome'] = df_plays['PlayOutcome'].loc[idx]

      # Replacing old row with cleaned row
      df_before_row = df_plays.iloc[:df_plays.index.tolist().index(idx)]
      df_after_row = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
      df_plays = pd.concat([df_before_row, sacked_fumble_row, df_after_row], ignore_index=True)

      # Recursion to update 'df_plays'
      if df_fumble_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_fumble_plays(df_plays, idx+len(sacked_fumble_row))

    ##################
    # KICKOFF FUMBLE #
    ##################

    kickoff_fumble = re.findall(kickoff_pattern, initial_action)
    if len(kickoff_fumble) > 0:

      # creating a copy of the passing fumbled play row and cleaning the copy
      kickoff_fumble_row = df_plays.loc[idx].copy()
      kickoff_fumble_row['PlayOutcome'] = 'kickoff'
      kickoff_fumble_row = pd.DataFrame([kickoff_fumble_row], columns=df_plays.columns)
      kickoff_fumble_row = clean_kickoff_plays(kickoff_fumble_row)
      kickoff_fumble_row['PlayOutcome'] = df_plays['PlayOutcome'].loc[idx]

      # Replacing old row with cleaned row
      df_before_row = df_plays.iloc[:df_plays.index.tolist().index(idx)]
      df_after_row = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
      df_plays = pd.concat([df_before_row, kickoff_fumble_row, df_after_row], ignore_index=True)

      # Recursion to update 'df_plays'
      if df_fumble_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_fumble_plays(df_plays, idx+len(kickoff_fumble_row))

  return df_plays

#### PENALTY PLAYS

In [34]:
# This probably does not cover every possible penalty play.
# For example, in this sample of plays there are no penalties during kickoffs
# when penalties during kickoffs are 100% possible.

def clean_penalty_plays(df_plays, index_start=None):

  # Cut 'df_plays' to begin from 'index_start' to the last penalty play available in dataframe
  if index_start != None:
    df_plays_adjusted = df_plays.loc[index_start:]
    df_plays_adjusted = df_plays.iloc[df_plays.index.tolist().index(index_start):]
    df_penalty_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('penalty', case=False)]
  else:
    df_penalty_plays = df_plays[df_plays['PlayOutcome'].str.contains('penalty', case=False)]

  # Iterating through every penalty play within 'df_penalty_plays'
  for idx, play in df_penalty_plays['PlayDescription'].items():

    ############
    # REVERSES #
    ############

    # In 'PlayDescription' all information before the "reversed" sentence is not needed.
    # - All information before is stored within 'ReverseDetails' and the remaining is cleaned.
    if play.find('REVERSED') != -1:
      play_elements = play.split(". ")
      for i in play_elements:
        if i.find("REVERSED") != -1:
          df_plays.at[idx, 'ReverseDetails'] = play_elements[:play_elements.index(i) + 1]
          play = ". ".join(play_elements[play_elements.index(i) + 1:])
          break

    ############################
    # REPORTING IN AS ELIGIBLE #
    ############################

    # I do not think this contains any useful data so I am going to exclude it.
    if play.find('reported in as eligible') != -1:
      play_elements = play.split(". ")
      for i in play_elements:
        if i.find('reported in as eligible') != -1:
          play = ". ".join(play_elements[play_elements.index(i) + 1:])
          break

    initial_action = play.split(". ")[0]

    ###############################
    # PENALTY DURING PASSING PLAY #
    ###############################

    penalty_pass = re.findall(receiver_pattern, initial_action)
    if len(penalty_pass) > 0 or play.find('pass incomplete') != -1:

      # creating a copy of the passing penalty play row and cleaning the copy
      passing_penalty_row = df_plays.loc[idx].copy()
      passing_penalty_row['PlayOutcome'] = 'Pass'
      passing_penalty_row = pd.DataFrame([passing_penalty_row], columns=df_plays.columns)
      passing_penalty_row = clean_pass_plays(passing_penalty_row)
      passing_penalty_row['PlayOutcome'] = df_plays['PlayOutcome'].loc[idx]
      passing_penalty_row['PlayType'] = 'No Play'

      # Replacing old row with cleaned row
      df_before_row = df_plays.iloc[:df_plays.index.tolist().index(idx)]
      df_after_row = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
      df_plays = pd.concat([df_before_row, passing_penalty_row, df_after_row], ignore_index=True)

      # Recursion to update 'df_plays'
      if df_penalty_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_penalty_plays(df_plays, idx+len(passing_penalty_row))

    ###############################
    # PENALTY DURING RUSHING PLAY #
    ###############################

    penalty_rush = re.findall(rusher_pattern, initial_action)
    if len(penalty_rush) > 0 or play.find('Aborted') != -1:

      # creating a copy of the rushing penalty play row and cleaning the copy
      rushing_penalty_row = df_plays.loc[idx].copy()
      rushing_penalty_row['PlayOutcome'] = 'Run'
      rushing_penalty_row = pd.DataFrame([rushing_penalty_row], columns=df_plays.columns)
      rushing_penalty_row = clean_run_plays(rushing_penalty_row)
      rushing_penalty_row['PlayOutcome'] = df_plays['PlayOutcome'].loc[idx]
      rushing_penalty_row['PlayType'] = 'No Play'

      # Replacing old row with cleaned row
      df_before_row = df_plays.iloc[:df_plays.index.tolist().index(idx)]
      df_after_row = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
      df_plays = pd.concat([df_before_row, rushing_penalty_row, df_after_row], ignore_index=True)

      # Recursion to update 'df_plays'
      if df_penalty_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_penalty_plays(df_plays, idx+len(rushing_penalty_row))

    ######################################
    # PENALTY DURING 2PT CONVERSION PLAY #
    ######################################

    if play.find('TWO-POINT CONVERSION ATTEMPT') != -1:

      # creating a copy of the 2pt conversion penalty play row and cleaning the copy
      two_pt_conversion_penalty_row = df_plays.loc[idx].copy()
      two_pt_conversion_penalty_row['PlayOutcome'] = '2PT Conversion'
      two_pt_conversion_penalty_row = pd.DataFrame([two_pt_conversion_penalty_row], columns=df_plays.columns)
      two_pt_conversion_penalty_row = cleaning_2pt_conversion_plays(two_pt_conversion_penalty_row)
      two_pt_conversion_penalty_row['PlayOutcome'] = df_plays['PlayOutcome'].loc[idx]
      two_pt_conversion_penalty_row['PlayType'] = 'No Play'

      # Replacing old row with cleaned row
      df_before_row = df_plays.iloc[:df_plays.index.tolist().index(idx)]
      df_after_row = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
      df_plays = pd.concat([df_before_row, two_pt_conversion_penalty_row, df_after_row], ignore_index=True)

      # Recursion to update 'df_plays'
      if df_penalty_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_penalty_plays(df_plays, idx+len(two_pt_conversion_penalty_row))

    #################
    # SACKED FUMBLE #
    #################

    if initial_action.find('sacked') != -1:

      # creating a copy of the sacked fumble play row and cleaning the copy
      sacked_penalty_row = df_plays.loc[idx].copy()
      sacked_penalty_row['PlayOutcome'] = 'Sack'
      sacked_penalty_row = pd.DataFrame([sacked_penalty_row], columns=df_plays.columns)
      sacked_penalty_row = clean_sacked_plays(sacked_penalty_row)
      sacked_penalty_row['PlayOutcome'] = df_plays['PlayOutcome'].loc[idx]
      sacked_penalty_row['PlayType'] = 'No Play'

      # Replacing old row with cleaned row
      df_before_row = df_plays.iloc[:df_plays.index.tolist().index(idx)]
      df_after_row = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
      df_plays = pd.concat([df_before_row, sacked_penalty_row, df_after_row], ignore_index=True)

      # Recursion to update 'df_plays'
      if df_penalty_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_penalty_plays(df_plays, idx+len(sacked_penalty_row))

    #########################
    # PENALTY (False Start) #
    #########################

    # Will use 'clean_run_plays' method to clean these
    # All other penalty plays (e.i. False Start, Delay of Game, Offside, Neutral Zone Infraction, Too Many Men on Field, Encroachment, Taunting)

    # if play.find('False Start') != -1 or play.find('Delay of Game') != -1:

    # TimeOnTheClock
    TimeOnTheClock = re.findall(time_on_clock_pattern, play)
    if len(TimeOnTheClock) > 0:
      df_plays.loc[idx, 'TimeOnTheClock'] = TimeOnTheClock[0]

    # Formation
    Formation = re.findall(formation, play)
    if len(Formation) > 0:
      if Formation[0] == 'Aborted':
        pass
      else:
        df_plays.loc[idx, 'Formation'] = Formation[0]

    df_plays.at[idx, 'AcceptedPenalty'] = play
    df_plays.at[idx, 'PlayType'] = 'No Play'

  return df_plays

#### TURNOVER ON DOWNS

In [35]:
# Looks like either a pass / run / sack play

def clean_turnover_on_downs_plays(df_plays, index_start=None):

  # Cut 'df_plays' to begin from 'index_start' to the last penalty play available in dataframe
  if index_start != None:
    df_plays_adjusted = df_plays.loc[index_start:]
    df_turnover_on_downs_plays = df_plays_adjusted[df_plays_adjusted['PlayOutcome'].str.contains('Turnover on Downs', case=False)]
  else:
    df_turnover_on_downs_plays = df_plays[df_plays['PlayOutcome'].str.contains('Turnover on Downs', case=False)]

  # Iterating through every penalty play within 'df_turnover_on_downs_plays'
  for idx, play in df_turnover_on_downs_plays['PlayDescription'].items():

    ############################
    # TURNOVER ON DOWNS (PASS) #
    ############################

    passing_play = re.findall(passer_name_pattern, play)
    if len(passing_play) > 0 and play.find("sacked") == -1:

      passing_turnover_on_downs = df_plays.loc[idx].copy()
      passing_turnover_on_downs['PlayOutcome'] = 'Pass'
      passing_turnover_on_downs = pd.DataFrame([passing_turnover_on_downs], columns=df_plays.columns)
      passing_turnover_on_downs = clean_pass_plays(passing_turnover_on_downs)

      # Record whether the pass was complete or incomplete.
      if play.find('pass incomplete') != -1:
        passing_turnover_on_downs['PlayOutcome'] = f"{df_plays['PlayOutcome'].loc[idx]} (I)"
      else:
        passing_turnover_on_downs['PlayOutcome'] = f"{df_plays['PlayOutcome'].loc[idx]} (C)"

      # Replacing old row with cleaned row
      df_before_row = df_plays.iloc[:df_plays.index.tolist().index(idx)]
      df_after_row = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
      df_plays = pd.concat([df_before_row, passing_turnover_on_downs, df_after_row], ignore_index=True)

      # Recursion to update 'df_plays'
      if df_turnover_on_downs_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_turnover_on_downs_plays(df_plays, idx+len(passing_turnover_on_downs))

    ############################
    # TURNOVER ON DOWNS (RUSH) #
    ############################

    rushing_play = re.findall(rusher_pattern, play)
    if len(rushing_play) > 0:

      rushing_turnover_on_downs = df_plays.loc[idx].copy()
      rushing_turnover_on_downs['PlayOutcome'] = 'Run'
      rushing_turnover_on_downs = pd.DataFrame([rushing_turnover_on_downs], columns=df_plays.columns)
      rushing_turnover_on_downs = clean_run_plays(rushing_turnover_on_downs)
      rushing_turnover_on_downs['PlayOutcome'] = df_plays['PlayOutcome'].loc[idx]

      df_before_row = df_plays.iloc[:df_plays.index.tolist().index(idx)]
      df_after_row = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
      df_plays = pd.concat([df_before_row, rushing_turnover_on_downs, df_after_row], ignore_index=True)

      if df_turnover_on_downs_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_turnover_on_downs_plays(df_plays, idx+len(rushing_turnover_on_downs))

    ##############################
    # TURNOVER ON DOWNS (SACKED) #
    ##############################

    if play.find("sacked") != -1:

      sacked_turnover_on_downs = df_plays.loc[idx].copy()
      sacked_turnover_on_downs['PlayOutcome'] = 'Sack'
      sacked_turnover_on_downs = pd.DataFrame([sacked_turnover_on_downs], columns=df_plays.columns)
      sacked_turnover_on_downs.reset_index(drop=True, inplace=True)
      sacked_turnover_on_downs = clean_sacked_plays(sacked_turnover_on_downs)
      sacked_turnover_on_downs['PlayOutcome'] = df_plays['PlayOutcome'].loc[idx]

      # Replacing old row with cleaned row
      df_before_row = df_plays.iloc[:df_plays.index.tolist().index(idx)]
      df_after_row = df_plays.iloc[df_plays.index.tolist().index(idx)+1:]
      df_plays = pd.concat([df_before_row, sacked_turnover_on_downs, df_after_row], ignore_index=True)

      # Recursion to update 'df_plays'
      if df_turnover_on_downs_plays.tail(1).index.tolist()[0] == idx:
        return df_plays
      else:
        return clean_turnover_on_downs_plays(df_plays, idx+len(sacked_turnover_on_downs))

## 4. PIPELINE MAIN METHOD

In [36]:
# PURPOSE:
# - Accept a dataframe of plays (dataframes formatted by NFL_Scrapers) and
#   return a cleaned dataframe of those plays.
# INPUT PARAMTERS:
# df_all_plays         - dataframe - all plays in raw form from NFL_Scraper that user
#                                    would like to clean.
# OUTPUT:
# df_all_plays_cleaned - dataframe - all plays from 'df_all_plays' cleaned and data
#                                    dispersed into individual new features.

# CURRENT DESIGN PLAN:
# 1. Use uniquely designed methods for each play type to clean within dataframe
#    - (e.g. pass, run, touchdown, punt, sack, ... )
# 2. Repeat until all plays within dataframe have been cleaned.
#   NOTE:
#   - It is important to fully clean a play type before moving to the next
#      because sometimes cleaning could involve adding a new row to the dataframe,
#      causing a reset to the dataframes indexing.
#      - If we were to separate all play types from the beginning, the indexes
#        could shift around causing, for example, an index that might originally
#        point to a run play to now instead point at a pass play.

# NOTES:
# - I think "PlayOutcomes" is what determines the yardage gained on an intended play?
#   - This does not seem right to me.
#   - EXAMPLE:
#     - (9:54) Bre.Hall left end to BUF 22 for -1 yards (G.Rousseau)
#       FUMBLES (G.Rousseau), ball out of bounds at BUF 25.
#       - I would think that Bre.Hall would get docked -1 yards for his run.
#         - But I believe that he is actually docked -4
#           - 'PlayStart' = 2nd & 9 at BUF 21
#           - The play ends at BUF 25
#             - In my opinion and how I am going to track yardage is based on
#               possession of the ball. So I will track this as -1 yard not -4.

def clean_dataframe_of_plays(df_all_plays):

  ################################
  # RAW DATA COLUMN DESCRIPTIONS #
  ################################
  # TeamWithPossession - Team that STARTED with the ball during the play. (The team that was on offense)

  ###########################
  # NEW COLUMN DESCRIPTIONS #
  ###########################

  # PlayType           - The type of play (e.g. pass/run)
  # TimeOnTheClock     - The time that was on the clock when the play started
  # Formation          - Play formation
  # Passer             - Player that threw the ball (mostly the quarterback)
  # Rusher             - Player that ran the ball (mostly the runningback)
  # Receiver           - Player on the same team as the passer that caught the ball
  # PassType           - Whether the pass was a deep or short pass?
  # Direction          - Where the ball is going during the play
  # Yardage            - Yards gained during the play
  # TackleBy1          - Main tackler on the play (could be solo or could be with someone else)
  # TackleBy2          - Assisted tackler1
  # PressureBy         - Defender that applied pressure to the passer
  # InterceptedBy      - Defender that intercepted the passing play
  # FumbleDetails      - A list that has what happened after the fumble
  #                      - [forced fumble by, recovered by, yards gained, tackled by]
  # ReverseDetails     - A list having plays leading up to play reversal
  # InjuredPlayers     - Players that were injured during the play
  # PenaltyDescription - If there is a penalty, gives a description of it
  #                      - [who caused the penalty, what was the penalty, yards lost if penalty accepted]

  # new_columns = ["PlayType", "TimeOnTheClock", "Formation", "Passer", "Rusher", "Receiver", "Direction", "Yardage",
  #               "TackleBy1", "TackleBy2", "PressureBy", "InterceptedBy", "SackedBy", "ForcedFumbleBy",
  #               "FumbleDetails", "ReverseDetails",
  #               "InjuredPlayers", "AcceptedPenalty", "DeclinedPenalty",
  #               "Kicker", "LongSnapper", "Returner", "DownedBy", "Holder", "BlockedBy"]

  # string_columns = ["PlayType", "TimeOnTheClock", "Formation", "Passer", "Rusher", "Receiver", "Direction",
  #                   "TackleBy1", "TackleBy2", "PressureBy", "InterceptedBy", "SackedBy", "ForcedFumbleBy",
  #                   "FumbleDetails", "ReverseDetails",
  #                   "InjuredPlayers", "AcceptedPenalty", "DeclinedPenalty",
  #                   "Kicker", "LongSnapper", "Returner", "DownedBy", "Holder", "BlockedBy"]

  # int_columns = ["Yardage"]


  # new_columns = ["PlayType", "TimeOnTheClock", "Formation", "Passer", "Rusher", "Receiver", "Direction", "Yardage",
  #               "SoloTackle", "AssistedTackle", "SharedTackle", "PressureBy", "InterceptedBy", "SackedBy", "ForcedFumbleBy",
  #               "FumbleDetails", "ReverseDetails",
  #               "InjuredPlayers", "AcceptedPenalty", "DeclinedPenalty",
  #               "Kicker", "LongSnapper", "Returner", "DownedBy", "Holder", "BlockedBy"]

  new_columns = ["PlayType", "TimeOnTheClock", "Formation", "Passer", "Rusher", "Receiver", "Direction", "Yardage",
                "SoloTackle", "AssistedTackle", "SharedTackle", "PressureBy", "InterceptedBy", "SackedBy", "ForcedFumbleBy",
                "FumbleDetails", "ReverseDetails",
                "InjuredPlayers", "OffensivePenalty", "DefensivePenalty",
                "Kicker", "LongSnapper", "Returner", "DownedBy", "Holder", "BlockedBy"]

  # string_columns = ["PlayType", "TimeOnTheClock", "Formation", "Passer", "Rusher", "Receiver", "Direction",
  #                   "SoloTackle", "AssistedTackle", "SharedTackle", "PressureBy", "InterceptedBy", "SackedBy", "ForcedFumbleBy",
  #                   "FumbleDetails", "ReverseDetails",
  #                   "InjuredPlayers", "AcceptedPenalty", "DeclinedPenalty",
  #                   "Kicker", "LongSnapper", "Returner", "DownedBy", "Holder", "BlockedBy"]

  string_columns = ["PlayType", "TimeOnTheClock", "Formation", "Passer", "Rusher", "Receiver", "Direction",
                    "SoloTackle", "AssistedTackle", "SharedTackle", "PressureBy", "InterceptedBy", "SackedBy", "ForcedFumbleBy",
                    "FumbleDetails", "ReverseDetails",
                    "InjuredPlayers", "OffensivePenalty", "DefensivePenalty",
                    "Kicker", "LongSnapper", "Returner", "DownedBy", "Holder", "BlockedBy"]

  int_columns = ["Yardage"]

  ########################################
  # RETURN DATAFRAME WITH ADDED FEATURES #
  ########################################

  df_all_plays_cleaned = df_all_plays.copy()
  df_all_plays_cleaned = df_all_plays_cleaned.reindex(columns=df_all_plays_cleaned.columns.tolist() + new_columns)
  df_all_plays_cleaned[string_columns] = df_all_plays_cleaned[string_columns].astype(str)
  df_all_plays_cleaned[int_columns] = df_all_plays_cleaned[int_columns].astype(float)

  ########################################
  # GETTING PLAY CATEGORIES AND CLEANING #
  ########################################

  # TOUCHDOWNS MUST BE CLEANED FIRST
  # - Any touchdown resulting from a change in possession (e.g. Interception for Touchdown)
  #   raw data states that the team on defense had possession the entire drive.
  #   - So all plays leading up to the touchdown state that the defense has possession.
  df_all_plays_cleaned = clean_touchdown_plays(df_all_plays_cleaned)
  # df_all_plays_cleaned = clean_run_plays(df_all_plays_cleaned)
  # df_all_plays_cleaned = clean_pass_plays(df_all_plays_cleaned)
  # df_all_plays_cleaned = cleaning_2pt_conversion_plays(df_all_plays_cleaned)
  # df_all_plays_cleaned = clean_intercepted_plays(df_all_plays_cleaned)
  # df_all_plays_cleaned = clean_sacked_plays(df_all_plays_cleaned)
  # df_all_plays_cleaned = clean_punt_plays(df_all_plays_cleaned)
  # df_all_plays_cleaned = clean_kickoff_plays(df_all_plays_cleaned)
  # df_all_plays_cleaned = clean_field_goal_plays(df_all_plays_cleaned)
  # df_all_plays_cleaned = clean_extra_point_plays(df_all_plays_cleaned)
  # df_all_plays_cleaned = clean_fumble_plays(df_all_plays_cleaned)
  # df_all_plays_cleaned = clean_penalty_plays(df_all_plays_cleaned)
  # df_all_plays_cleaned = clean_turnover_on_downs_plays(df_all_plays_cleaned)

  return df_all_plays_cleaned

# TESTING (Helper Methods)

In [37]:
# PURPOSE:
# - A tool that can be used to compare original plays and their cleaned versions

# I would like to return a map that has:
# KEY: index of original unclean play
# VALUE: index(es) of cleaned play

def unclean_clean_matches(df_unclean_plays, df_clean_plays):

  my_map = {}

  # This group of features is unique to each play
  # - Both the unclean and cleaned versions of the plays have these
  # - These features will be used to find the matching plays between the unclean df and the cleaned df
  matching_features = ['Season', 'Week', 'Date', 'AwayTeam', 'HomeTeam', 'Quarter', 'DriveNumber', 'TeamWithPossession', 'PlayNumberInDrive']

  # Iterate through each row of the dataframe of unclean plays
  for u_row in df_unclean_plays.itertuples(index=True):
    u_features = [getattr(u_row, col) for col in matching_features]

    matching_indexes = []
    matches_found = False

    # Iterate through each row of the dataframe of cleaned plays
    # - The starting index will be the index of the unclean play within the main original dataframe of plays
    #   - The matching cleaned pair will either be at the exact same location or higher
    for c_row in df_clean_plays[u_row.Index::].itertuples(index=True):
      c_features = [getattr(c_row, col) for col in matching_features]

      # If a match is found, check for consective rows of matches because some uncleaned plays needed to be cleaned using multiple rows
      # - Once a row that does not match follows one that does, will break the loop because the one play match has been found.
      if u_features == c_features:
        matching_indexes.append(c_row.Index)
        matches_found = True
      elif matches_found:
        my_map[u_row.Index] = matching_indexes
        break

  return my_map

# TESTING AREA

In [74]:
df_week1_plays_cleaned = clean_dataframe_of_plays(week1_2023_plays_modified)

832


TypeError: cannot concatenate object of type '<class 'str'>'; only Series and DataFrame objs are valid

In [None]:
df_week1_plays_cleaned.shape

## PLAYTYPE OBSERVATIONS
- Looking at each play from each playtype

### Passing plays

In [None]:
# Number of passing type plays during 2023, Week 1

df_unclean_pass_plays = week1_2023_plays_modified.loc[week1_2023_plays_modified['PlayOutcome'].str.contains('Pass')]

map_passing_plays = unclean_clean_matches(df_unclean_pass_plays, df_week1_plays_cleaned)

len(map_passing_plays.keys())

In [None]:
# Every unclean passing play and their associated cleaned play breakdown

for i in map_passing_plays.keys():
  print(f"({i}, {map_passing_plays.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

In [None]:
# passing type plays during 2023, Week 1 that have been spiked

df_unclean_pass_plays_spiked = week1_2023_plays_modified.loc[(week1_2023_plays_modified['PlayOutcome'].str.contains('Pass')) &
                                                             (week1_2023_plays_modified['PlayDescription'].str.contains('spiked', case=False))]

map_passing_spiked_plays = unclean_clean_matches(df_unclean_pass_plays_spiked, df_week1_plays_cleaned)

for i in map_passing_spiked_plays.keys():
  print(f"({i}, {map_passing_spiked_plays.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

In [None]:
# passing type plays during 2023, Week 1 that result in touchdown

df_unclean_pass_plays_touchdown = week1_2023_plays_modified.loc[(week1_2023_plays_modified['PlayOutcome'].str.contains('touchdown', case=False)) &
                                                                (week1_2023_plays_modified['PlayDescription'].str.contains('pass', case=False))]

map_passing_touchdown_plays = unclean_clean_matches(df_unclean_pass_plays_touchdown, df_week1_plays_cleaned)

for i in map_passing_touchdown_plays.keys():
  print(f"({i}, {map_passing_touchdown_plays.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

In [None]:
# passing type plays during 2023, Week 1 that result in touchdown

df_unclean_pass_plays_touchdown = week1_2023_plays_modified.loc[(week1_2023_plays_modified['PlayOutcome'].str.contains('touchdown', case=False)) &
                                                                (week1_2023_plays_modified['PlayDescription'].str.contains('PENALTY', case=False))]

map_passing_touchdown_plays = unclean_clean_matches(df_unclean_pass_plays_touchdown, df_week1_plays_cleaned)

for i in map_passing_touchdown_plays.keys():
  print(f"({i}, {map_passing_touchdown_plays.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

In [None]:
# every passing play that resulted in a fumble (including fumble recoveries resulting in a touchdown)

df_unclean_pass_fumble_plays = week1_2023_plays_modified.loc[((week1_2023_plays_modified['PlayOutcome'].str.contains('Pass')) |
                                                             ((week1_2023_plays_modified['PlayDescription'].str.contains('Touchdown', case=False)) &
                                                              (week1_2023_plays_modified['PlayOutcome'].str.contains('Pass')))) &
                                                              (week1_2023_plays_modified['PlayDescription'].str.contains('fumbles', case=False))]

for i in unclean_clean_matches(df_unclean_pass_fumble_plays, df_week1_plays_cleaned).items():
  print(i)

In [None]:
dict_unclean_to_clean_pass_fumble_plays = unclean_clean_matches(df_unclean_pass_fumble_plays, df_week1_plays_cleaned)

for i in dict_unclean_to_clean_pass_fumble_plays.keys():
  # print(i)
  print(f"({i}, {dict_unclean_to_clean_pass_fumble_plays.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

### Rushing plays

In [None]:
# Number of running type plays during 2023, Week 1

df_unclean_run_plays = week1_2023_plays_modified.loc[week1_2023_plays_modified['PlayOutcome'].str.contains('Run')]

map_run_plays = unclean_clean_matches(df_unclean_run_plays, df_week1_plays_cleaned)

len(map_run_plays.keys())

In [None]:
# Every unclean passing play and their associated cleaned play breakdown

for i in map_run_plays.keys():
  print(f"({i}, {map_run_plays.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

In [None]:
# penalty rushing plays

df_unclean_rush_penalty_plays = week1_2023_plays_modified.loc[(week1_2023_plays_modified['PlayOutcome'].str.contains('Run')) &
                                                              (week1_2023_plays_modified['PlayDescription'].str.contains('penalty', case=False))]

dict_unclean_rush_penalty_plays = unclean_clean_matches(df_unclean_rush_penalty_plays, df_week1_plays_cleaned)

for i in dict_unclean_rush_penalty_plays.keys():
  print(f"({i}, {dict_unclean_rush_penalty_plays.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

In [None]:
# fumbled rushing plays (not including touchdowns)

df_unclean_rush_fumble_plays = week1_2023_plays_modified.loc[(week1_2023_plays_modified['PlayOutcome'].str.contains('Run')) &
                                                             (week1_2023_plays_modified['PlayDescription'].str.contains('fumbles', case=False))]

for i in unclean_clean_matches(df_unclean_rush_fumble_plays, df_week1_plays_cleaned).items():
  print(i)

In [None]:
dict_unclean_to_clean_rush_fumble_plays = unclean_clean_matches(df_unclean_rush_fumble_plays, df_week1_plays_cleaned)

for i in dict_unclean_to_clean_rush_fumble_plays.keys():
  print(f"({i}, {dict_unclean_to_clean_rush_fumble_plays.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

In [None]:
# All rushing touchdowns

df_unclean_pass_plays_touchdown = week1_2023_plays_modified.loc[(week1_2023_plays_modified['PlayOutcome'].str.contains('touchdown', case=False))]

list_all_touchdown_rushing_plays = []

for idx, play in df_unclean_pass_plays_touchdown['PlayDescription'].items():
  run_play = re.findall(rusher_pattern, play)
  if len(run_play) > 0:
    list_all_touchdown_rushing_plays.append(idx)

map_rushing_touchdown_plays = unclean_clean_matches(week1_2023_plays_modified.loc[list_all_touchdown_rushing_plays], df_week1_plays_cleaned)

for i in map_rushing_touchdown_plays.keys():
  print(f"({i}, {map_rushing_touchdown_plays.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

###2pt Conversions

In [None]:
# All extra point plays

df_unclean_2pt_conversion_week1 = week1_2023_plays_modified[week1_2023_plays_modified['PlayOutcome'].str.contains('2PT Conversion')]

dict_unclean_to_clean_2ptc = unclean_clean_matches(df_unclean_2pt_conversion_week1, df_week1_plays_cleaned)

print(f"{len(dict_unclean_to_clean_2ptc)} number of 2pt conversion attempts")
print("\n\n")
for i in dict_unclean_to_clean_2ptc.keys():
  print(f"({i}, {dict_unclean_to_clean_2ptc.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

In [None]:
# All passing 2PT conversion attempts

index_pass_2ptc = []

for i in list(df_unclean_2pt_conversion_week1.index):
  pass_2ptc = re.findall(tp_conversion_pass_pattern, week1_2023_plays_modified['PlayDescription'].iloc[i])
  if len(pass_2ptc) > 0:
    index_pass_2ptc.append(i)

dict_unclean_to_clean_pass_2ptc = unclean_clean_matches(week1_2023_plays_modified.iloc[index_pass_2ptc], df_week1_plays_cleaned)

print(f"{len(dict_unclean_to_clean_pass_2ptc)} number of 2pt conversion pass attempts")
print("\n\n")
for i in dict_unclean_to_clean_pass_2ptc.keys():
  print(f"({i}, {dict_unclean_to_clean_pass_2ptc.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

In [None]:
# All rushing 2PT conversion attempts

index_rush_2ptc = []

for i in list(df_unclean_2pt_conversion_week1.index):
  rush_2ptc = re.findall(tp_conversion_rush_pattern, week1_2023_plays_modified['PlayDescription'].iloc[i])
  if len(rush_2ptc) > 0:
    index_rush_2ptc.append(i)

dict_unclean_to_clean_rush_2ptc = unclean_clean_matches(week1_2023_plays_modified.iloc[index_rush_2ptc], df_week1_plays_cleaned)

print(f"{len(dict_unclean_to_clean_rush_2ptc)} number of 2pt conversion attempts")
print("\n\n")
for i in dict_unclean_to_clean_rush_2ptc.keys():
  print(f"({i}, {dict_unclean_to_clean_rush_2ptc.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

### Intercepted plays

In [None]:
df_unclean_intercepted_plays = week1_2023_plays_modified.loc[(week1_2023_plays_modified['PlayDescription'].str.contains('INTERCEPTED', case=False)) |
                                                             (week1_2023_plays_modified['PlayOutcome'].str.contains('Interception', case=False))]

dict_unclean_to_clean_intercepted_plays = unclean_clean_matches(df_unclean_intercepted_plays, df_week1_plays_cleaned)

for i in dict_unclean_to_clean_intercepted_plays.keys():
  print(f"({i}, {dict_unclean_to_clean_intercepted_plays.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

In [None]:
# All interceptions resulting in a touchdown

df_unclean_intercepted_touchdown_plays = week1_2023_plays_modified.loc[(week1_2023_plays_modified['PlayDescription'].str.contains('INTERCEPTED', case=False)) &
                                                                       (week1_2023_plays_modified['PlayOutcome'].str.contains('touchdown', case=False))]

dict_unclean_to_clean_intercepted_touchdown_plays = unclean_clean_matches(df_unclean_intercepted_touchdown_plays, df_week1_plays_cleaned)

for i in dict_unclean_to_clean_intercepted_touchdown_plays.keys():
  print(f"({i}, {dict_unclean_to_clean_intercepted_touchdown_plays.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

### Sacked Plays

In [None]:
df_unclean_sacked_plays = week1_2023_plays_modified.loc[(week1_2023_plays_modified['PlayOutcome'].str.contains('Sack', case=False))]

dict_unclean_to_clean_sacked_plays = unclean_clean_matches(df_unclean_sacked_plays, df_week1_plays_cleaned)

for i in dict_unclean_to_clean_sacked_plays.keys():
  print(f"({i}, {dict_unclean_to_clean_sacked_plays.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

In [None]:
# All sacked plays resulting in a touchdown

df_unclean_sacked_touchdown_plays = week1_2023_plays_modified.loc[(week1_2023_plays_modified['PlayOutcome'].str.contains('touchdown', case=False)) &
                                                                  (week1_2023_plays_modified['PlayDescription'].str.contains('sack', case=False))]

dict_unclean_to_clean_sacked_touchdown_plays = unclean_clean_matches(df_unclean_sacked_touchdown_plays, df_week1_plays_cleaned)

for i in dict_unclean_to_clean_sacked_touchdown_plays.keys():
  print(f"({i}, {dict_unclean_to_clean_sacked_touchdown_plays.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

### Punt Plays

In [None]:
df_unclean_punt_plays = week1_2023_plays_modified.loc[week1_2023_plays_modified['PlayDescription'].str.contains('punts', case=False)]

dict_unclean_to_clean_punt_plays = unclean_clean_matches(df_unclean_punt_plays, df_week1_plays_cleaned)

for i in dict_unclean_to_clean_punt_plays.keys():
  print(f"({i}, {dict_unclean_to_clean_punt_plays.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

In [None]:
# All punt return touchdown plays

df_unclean_punt_touchdown_plays = week1_2023_plays_modified.loc[week1_2023_plays_modified['PlayDescription'].str.contains('punts', case=False) &
                                                                week1_2023_plays_modified['PlayDescription'].str.contains('touchdown', case=False)]

dict_unclean_to_clean_punt_touchdown_plays = unclean_clean_matches(df_unclean_punt_touchdown_plays, df_week1_plays_cleaned)

for i in dict_unclean_to_clean_punt_touchdown_plays.keys():
  print(f"({i}, {dict_unclean_to_clean_punt_touchdown_plays.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

### Kickoffs

In [None]:
# All kickoff plays

df_unclean_kickoff_plays = week1_2023_plays_modified.loc[week1_2023_plays_modified['PlayOutcome'].str.contains('kickoff', case=False)]

dict_unclean_to_clean_kickoff_plays = unclean_clean_matches(df_unclean_kickoff_plays, df_week1_plays_cleaned)

for i in dict_unclean_to_clean_kickoff_plays.keys():
  print(f"({i}, {dict_unclean_to_clean_kickoff_plays.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

In [None]:
# All onside kicks

df_unclean_kickoff_onside_plays = week1_2023_plays_modified.loc[week1_2023_plays_modified['PlayOutcome'].str.contains('kickoff', case=False) &
                                                                week1_2023_plays_modified['PlayDescription'].str.contains('onside', case=False)]

dict_unclean_to_clean_kickoff_onside_plays = unclean_clean_matches(df_unclean_kickoff_onside_plays, df_week1_plays_cleaned)

for i in dict_unclean_to_clean_kickoff_onside_plays.keys():
  print(f"({i}, {dict_unclean_to_clean_kickoff_onside_plays.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

### Touchdown plays

In [None]:
# All touchdown plays

df_unclean_touchdown_plays = week1_2023_plays_modified.loc[week1_2023_plays_modified['PlayOutcome'].str.contains('touchdown', case=False)]

dict_unclean_to_clean_touchdown_plays = unclean_clean_matches(df_unclean_touchdown_plays, df_week1_plays_cleaned)

for i in dict_unclean_to_clean_touchdown_plays.keys():
  print(f"({i}, {dict_unclean_to_clean_touchdown_plays.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

### Field goals

In [None]:
# All field goal plays

df_unclean_fieldgoal_week1 = week1_2023_plays_modified[week1_2023_plays_modified['PlayOutcome'].str.contains('Field Goal')]

dict_unclean_to_clean_field_goal_plays = unclean_clean_matches(df_unclean_fieldgoal_week1, df_week1_plays_cleaned)

# Number of field goal plays
print(f"{len(dict_unclean_to_clean_field_goal_plays)} number of field goal plays")
print("\n\n")
for i in dict_unclean_to_clean_field_goal_plays.keys():
  print(f"({i}, {dict_unclean_to_clean_field_goal_plays.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

In [None]:
# All field goal plays (good)

made_field_goal_play_indexes = []

for i in list(df_2023_fieldgoal_week1.index):
  made_field_goal = re.findall(field_goal_good_pattern, week1_2023_plays_modified['PlayDescription'].iloc[i])
  if len(made_field_goal) > 0:
    made_field_goal_play_indexes.append(i)

dict_unclean_to_clean_good_field_goals = unclean_clean_matches(week1_2023_plays_modified.iloc[made_field_goal_play_indexes], df_week1_plays_cleaned)

# Number of field goal plays
print(f"{len(dict_unclean_to_clean_good_field_goals)} number of good field goal plays")
print("\n\n")
for i in dict_unclean_to_clean_good_field_goals.keys():
  print(f"({i}, {dict_unclean_to_clean_good_field_goals.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

In [None]:
# All field goal plays (no good)

no_good_field_goal_play_indexes = []

for i in list(df_2023_fieldgoal_week1.index):
  made_field_goal = re.findall(field_goal_no_good_pattern, week1_2023_plays_modified['PlayDescription'].iloc[i])
  if len(made_field_goal) > 0:
    no_good_field_goal_play_indexes.append(i)

dict_unclean_to_clean_no_good_field_goals = unclean_clean_matches(week1_2023_plays_modified.iloc[no_good_field_goal_play_indexes], df_week1_plays_cleaned)

# Number of field goal plays
print(f"{len(dict_unclean_to_clean_no_good_field_goals)} number of no good field goal plays")
print("\n\n")
for i in dict_unclean_to_clean_no_good_field_goals.keys():
  print(f"({i}, {dict_unclean_to_clean_no_good_field_goals.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

In [None]:
# All field goal plays (special)

special_field_goal_play_indexes = []

special_field_goal_play_indexes = list(df_2023_fieldgoal_week1.index)

for i in made_field_goal_play_indexes:
  special_field_goal_play_indexes.pop(special_field_goal_play_indexes.index(i))

for i in no_good_field_goal_play_indexes:
  special_field_goal_play_indexes.pop(special_field_goal_play_indexes.index(i))


dict_unclean_to_clean_special_field_goals = unclean_clean_matches(week1_2023_plays_modified.iloc[special_field_goal_play_indexes], df_week1_plays_cleaned)

# Number of field goal plays
print(f"{len(dict_unclean_to_clean_special_field_goals)} number of special field goal plays")
print("\n\n")
for i in dict_unclean_to_clean_special_field_goals.keys():
  print(f"({i}, {dict_unclean_to_clean_special_field_goals.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

### Extra Points

In [None]:
# All extra point plays

df_unclean_extrapoint_week1 = week1_2023_plays_modified[week1_2023_plays_modified['PlayOutcome'].str.contains('Extra Point')]

dict_unclean_to_clean_extrapoint = unclean_clean_matches(df_unclean_extrapoint_week1, df_week1_plays_cleaned)

print(f"{len(dict_unclean_to_clean_extrapoint)} number of field goal plays")
print("\n\n")
for i in dict_unclean_to_clean_extrapoint.keys():
  print(f"({i}, {dict_unclean_to_clean_extrapoint.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

In [None]:
# All extra point plays (good)

extra_point_good_index_list = []

for i in list(df_2023_extrapoint_week1.index):
  made_extra_point = re.findall(extra_point_good_pattern, week1_2023_plays_modified['PlayDescription'].iloc[i])
  if len(made_extra_point) > 0:
    extra_point_good_index_list.append(i)

dict_unclean_to_clean_extrapoint_good = unclean_clean_matches(week1_2023_plays_modified.iloc[extra_point_good_index_list], df_week1_plays_cleaned)

print(f"{len(dict_unclean_to_clean_extrapoint_good)} number of field goal plays")
print("\n\n")
for i in dict_unclean_to_clean_extrapoint_good.keys():
  print(f"({i}, {dict_unclean_to_clean_extrapoint_good.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

In [None]:
# All extra point plays (no good)

extra_point_no_good_index_list = []

for i in list(df_2023_extrapoint_week1.index):
  no_good_extra_point = re.findall(extra_point_no_good_pattern, week1_2023_plays_modified['PlayDescription'].iloc[i])
  if len(no_good_extra_point) > 0:
    extra_point_no_good_index_list.append(i)

dict_unclean_to_clean_extrapoint_no_good = unclean_clean_matches(week1_2023_plays_modified.iloc[extra_point_no_good_index_list], df_week1_plays_cleaned)

print(f"{len(dict_unclean_to_clean_extrapoint_no_good)} number of field goal plays")
print("\n\n")
for i in dict_unclean_to_clean_extrapoint_no_good.keys():
  print(f"({i}, {dict_unclean_to_clean_extrapoint_no_good.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

In [None]:
# Blocked extra point?

### Fumbles

In [None]:
# All fumbled plays

df_unclean_fumble_plays = week1_2023_plays_modified.loc[week1_2023_plays_modified['PlayOutcome'].str.contains('fumble', case=False)]

dict_unclean_to_clean_fumble_plays = unclean_clean_matches(df_unclean_fumble_plays, df_week1_plays_cleaned)

print(f"{len(dict_unclean_to_clean_fumble_plays)} number of fumbled plays")
print("\n\n")
for i in dict_unclean_to_clean_fumble_plays.keys():
  print(f"({i}, {dict_unclean_to_clean_fumble_plays.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

In [None]:
# All passing fumble plays

index_fumble_pass_plays = []

for i in list(df_unclean_fumble_plays.index):
  fumble_pass = re.findall(receiver_pattern, week1_2023_plays_modified['PlayDescription'].iloc[i])
  if len(fumble_pass) > 0:
    index_fumble_pass_plays.append(i)

dict_unclean_to_clean_fumble_pass = unclean_clean_matches(week1_2023_plays_modified.iloc[index_fumble_pass_plays], df_week1_plays_cleaned)

print(f"{len(dict_unclean_to_clean_fumble_pass)} number of passing fumble plays")
print("\n\n")
for i in dict_unclean_to_clean_fumble_pass.keys():
  print(f"({i}, {dict_unclean_to_clean_fumble_pass.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

In [None]:
# All rushing fumble plays

index_fumble_run_plays = []

for i in list(df_unclean_fumble_plays.index):
  fumble_pass = re.findall(rusher_pattern, week1_2023_plays_modified['PlayDescription'].iloc[i])
  if len(fumble_pass) > 0:
    index_fumble_run_plays.append(i)

dict_unclean_to_clean_fumble_run = unclean_clean_matches(week1_2023_plays_modified.iloc[index_fumble_run_plays], df_week1_plays_cleaned)

print(f"{len(dict_unclean_to_clean_fumble_run)} number of rushing fumble plays")
print("\n\n")
for i in dict_unclean_to_clean_fumble_run.keys():
  print(f"({i}, {dict_unclean_to_clean_fumble_run.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

In [None]:
# All sacked fumble plays

index_fumble_sacked_plays = []

for i in list(df_unclean_fumble_plays.index):
  if week1_2023_plays_modified['PlayDescription'].iloc[i].find('sacked') != -1:
    index_fumble_sacked_plays.append(i)

dict_unclean_to_clean_sacked_fumble = unclean_clean_matches(week1_2023_plays_modified.iloc[index_fumble_sacked_plays], df_week1_plays_cleaned)

print(f"{len(dict_unclean_to_clean_sacked_fumble)} number of sacked fumble plays")
print("\n\n")
for i in dict_unclean_to_clean_sacked_fumble.keys():
  print(f"({i}, {dict_unclean_to_clean_sacked_fumble.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

In [None]:
# All Aborted fumbled plays

# week1_2023_plays['PlayOutcome'].str.contains('fumble', case=False)

index_fumble_aborted_plays = []

for i in list(df_unclean_fumble_plays.index):
  if week1_2023_plays_modified['PlayDescription'].iloc[i].find('Aborted') != -1:
    index_fumble_aborted_plays.append(i)

dict_unclean_to_clean_aborted_fumble = unclean_clean_matches(week1_2023_plays_modified.iloc[index_fumble_aborted_plays], df_week1_plays_cleaned)

print(f"{len(dict_unclean_to_clean_aborted_fumble)} number of aborted fumble plays")
print("\n\n")
for i in dict_unclean_to_clean_aborted_fumble.keys():
  print(f"({i}, {dict_unclean_to_clean_aborted_fumble.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

In [None]:
# All special fumbled plays

index_fumble_special_plays = list(df_unclean_fumble_plays.index)

for i in index_fumble_pass_plays:
  index_fumble_special_plays.pop(index_fumble_special_plays.index(i))

for i in index_fumble_run_plays:
  index_fumble_special_plays.pop(index_fumble_special_plays.index(i))

for i in index_fumble_sacked_plays:
  index_fumble_special_plays.pop(index_fumble_special_plays.index(i))

for i in index_fumble_aborted_plays:
  index_fumble_special_plays.pop(index_fumble_special_plays.index(i))

dict_unclean_to_clean_fumble_special = unclean_clean_matches(week1_2023_plays_modified.iloc[index_fumble_special_plays], df_week1_plays_cleaned)

print(f"{len(dict_unclean_to_clean_fumble_special)} number of special fumble plays")
print("\n\n")
for i in dict_unclean_to_clean_fumble_special.keys():
  print(f"({i}, {dict_unclean_to_clean_fumble_special.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

### Penalties

In [None]:
# What is the difference between these penalties and penalties in other play outcomes?

# All plays with "penalty" outcomes

df_unclean_penalty_plays = week1_2023_plays_modified.loc[week1_2023_plays_modified['PlayOutcome'].str.contains('penalty', case=False)]

dict_unclean_to_clean_penalty_plays = unclean_clean_matches(df_unclean_penalty_plays, df_week1_plays_cleaned)

# Number of penalty plays
print(f"{len(df_unclean_penalty_plays)} number of penalty plays")
print("\n\n")
for i in dict_unclean_to_clean_penalty_plays.keys():
  print(f"({i}, {dict_unclean_to_clean_penalty_plays.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

In [None]:
# All passing plays with "penalty" outcomes

# Grabbing all penalty plays within original dataframe
df_unclean_penalty_plays = week1_2023_plays_modified.loc[week1_2023_plays_modified['PlayOutcome'].str.contains('penalty', case=False)]

# List for all indexes that are passing play type penalty plays
list_unclean_penalty_passing_plays = []

for idx, play in df_unclean_penalty_plays['PlayDescription'].items():
  passing_play = re.findall(receiver_pattern, play)
  if len(passing_play) > 0 or play.find('pass incomplete') != -1:
    list_unclean_penalty_passing_plays.append(idx)

# Dataframe of all passing plays with "penalty" outcomes
df_unclean_penalty_passing_plays = week1_2023_plays_modified.iloc[list_unclean_penalty_passing_plays]

dict_unclean_to_clean_penalty_passing_plays = unclean_clean_matches(df_unclean_penalty_passing_plays, df_week1_plays_cleaned)

# Number of passing penalty plays
print(f"{len(list_unclean_penalty_passing_plays)} number of passing penalty plays")
print("\n\n")
for i in dict_unclean_to_clean_penalty_passing_plays.keys():
  print(f"({i}, {dict_unclean_to_clean_penalty_passing_plays.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

In [None]:
# All rushing plays with "penalty" outcomes

# Grabbing all penalty plays within original dataframe
df_unclean_penalty_plays = week1_2023_plays_modified.loc[week1_2023_plays_modified['PlayOutcome'].str.contains('penalty', case=False)]

# List for all indexes that are passing play type penalty plays
list_unclean_penalty_rushing_plays = []

for idx, play in df_unclean_penalty_plays['PlayDescription'].items():
  passing_play = re.findall(rusher_pattern, play)
  if len(passing_play) > 0 or play.find('Aborted') != -1:
    list_unclean_penalty_rushing_plays.append(idx)

# Dataframe of all passing plays with "penalty" outcomes
df_unclean_penalty_rushing_plays = week1_2023_plays_modified.iloc[list_unclean_penalty_rushing_plays]

dict_unclean_to_clean_penalty_rushing_plays = unclean_clean_matches(df_unclean_penalty_rushing_plays, df_week1_plays_cleaned)

# Number of rushing penalty plays
print(f"{len(list_unclean_penalty_rushing_plays)} number of passing penalty plays")
print("\n\n")
for i in dict_unclean_to_clean_penalty_rushing_plays.keys():
  print(f"({i}, {dict_unclean_to_clean_penalty_rushing_plays.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

In [None]:
# All "False Start" or "Delay of Game" plays with "penalty" outcomes

# Grabbing all penalty plays within original dataframe
df_unclean_penalty_plays = week1_2023_plays_modified.loc[week1_2023_plays_modified['PlayOutcome'].str.contains('penalty', case=False)]

# List for all indexes that are passing play type penalty plays
list_unclean_penalty_false_start_plays = []

for idx, play in df_unclean_penalty_plays['PlayDescription'].items():
  if play.find('False Start') != -1 or play.find('Delay of Game') != -1:
    list_unclean_penalty_false_start_plays.append(idx)

dict_unclean_to_clean_penalty_false_start_plays = unclean_clean_matches(week1_2023_plays_modified.iloc[list_unclean_penalty_false_start_plays], df_week1_plays_cleaned)

# Number of rushing penalty plays
print(f"{len(list_unclean_penalty_false_start_plays)} number of false start plays")
print("\n\n")
for i in dict_unclean_to_clean_penalty_false_start_plays.keys():
  print(f"({i}, {dict_unclean_to_clean_penalty_false_start_plays.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

In [None]:
# All sacked plays with "penalty" outcomes

# Grabbing all penalty plays within original dataframe
df_unclean_penalty_plays = week1_2023_plays_modified.loc[week1_2023_plays_modified['PlayOutcome'].str.contains('penalty', case=False)]

# List for all indexes that are passing play type penalty plays
list_unclean_penalty_sacked_plays = []

for idx, play in df_unclean_penalty_plays['PlayDescription'].items():
  if play.find('sacked') != -1:
    list_unclean_penalty_sacked_plays.append(idx)

dict_unclean_to_clean_penalty_sacked_plays = unclean_clean_matches(week1_2023_plays_modified.iloc[list_unclean_penalty_sacked_plays], df_week1_plays_cleaned)

# Number of rushing penalty plays
print(f"{len(list_unclean_penalty_sacked_plays)} number of false start plays")
print("\n\n")
for i in dict_unclean_to_clean_penalty_sacked_plays.keys():
  print(f"({i}, {dict_unclean_to_clean_penalty_sacked_plays.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

In [None]:
# All TWO-POINT CONVERSION ATTEMPT plays with "penalty" outcomes

# Grabbing all penalty plays within original dataframe
df_unclean_penalty_plays = week1_2023_plays_modified.loc[week1_2023_plays_modified['PlayOutcome'].str.contains('penalty', case=False)]

# List for all indexes that are passing play type penalty plays
list_unclean_penalty_2pt_plays = []

for idx, play in df_unclean_penalty_plays['PlayDescription'].items():
  if play.find('TWO-POINT CONVERSION ATTEMPT') != -1:
    list_unclean_penalty_2pt_plays.append(idx)

dict_unclean_to_clean_penalty_2pt_plays = unclean_clean_matches(week1_2023_plays_modified.iloc[list_unclean_penalty_2pt_plays], df_week1_plays_cleaned)

# Number of rushing penalty plays
print(f"{len(list_unclean_penalty_2pt_plays)} number of false start plays")
print("\n\n")
for i in dict_unclean_to_clean_penalty_2pt_plays.keys():
  print(f"({i}, {dict_unclean_to_clean_penalty_2pt_plays.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

In [None]:
# All special plays with "penalty" outcomes

# Grabbing all penalty plays within original dataframe
df_unclean_penalty_plays = week1_2023_plays_modified.loc[week1_2023_plays_modified['PlayOutcome'].str.contains('penalty', case=False)]

# List for all indexes that are passing play type penalty plays
list_unclean_penalty_special_plays = list(df_unclean_penalty_plays.index)

for i in list_unclean_penalty_passing_plays:
  list_unclean_penalty_special_plays.pop(list_unclean_penalty_special_plays.index(i))

for i in list_unclean_penalty_rushing_plays:
  list_unclean_penalty_special_plays.pop(list_unclean_penalty_special_plays.index(i))

for i in list_unclean_penalty_false_start_plays:
  list_unclean_penalty_special_plays.pop(list_unclean_penalty_special_plays.index(i))

for i in list_unclean_penalty_sacked_plays:
  list_unclean_penalty_special_plays.pop(list_unclean_penalty_special_plays.index(i))

for i in list_unclean_penalty_2pt_plays:
  list_unclean_penalty_special_plays.pop(list_unclean_penalty_special_plays.index(i))

dict_unclean_to_clean_penalty_special_plays = unclean_clean_matches(week1_2023_plays_modified.iloc[list_unclean_penalty_special_plays], df_week1_plays_cleaned)

# Number of rushing penalty plays
print(f"{len(list_unclean_penalty_special_plays)} number of passing penalty plays")
print("\n\n")
for i in dict_unclean_to_clean_penalty_special_plays.keys():
  print(f"({i}, {dict_unclean_to_clean_penalty_special_plays.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

### Turnover On Downs

In [None]:
# All turnover on downs

df_unclean_turnover_on_downs_week1 = week1_2023_plays_modified[week1_2023_plays_modified['PlayOutcome'].str.contains('Turnover on Downs')]

dict_unclean_to_clean_turnover_on_downs = unclean_clean_matches(df_unclean_turnover_on_downs_week1, df_week1_plays_cleaned)

print(f"{len(dict_unclean_to_clean_turnover_on_downs)} number of field goal plays")
print("\n\n")
for i in dict_unclean_to_clean_turnover_on_downs.keys():
  print(f"({i}, {dict_unclean_to_clean_turnover_on_downs.get(i)})")
  play = week1_2023_plays_modified['PlayDescription'].iloc[i]
  play_split = play.split(". ")
  for j in play_split:
    print(j)
  print()

## Index searching

In [None]:
week1_2023_plays_modified.iloc[0]

In [61]:
# df_week1_plays_cleaned.iloc[379]
df_week1_plays_cleaned.iloc[833]
# df_week1_plays_cleaned['PlayDescription'].iloc[832]

Unnamed: 0,833
Season,2023
Week,Week 1
Day,SUN
Date,09/10
AwayTeam,Cowboys
HomeTeam,Giants
Quarter,1
DriveNumber,1
TeamWithPossession,NYG
IsScoringDrive,1


# CLEANED DATASET OBSERVATIONS
- Attempting to grab basic stats on players for a single game

## Helper Methods

In [None]:
# Get rid of duplicate rows (for play's that have multiple rows)

def no_duplicates(df_with_duplicates, index_start=None):

  # exit case
  # - The last element has been grabbed
  if df_with_duplicates.tail(1).index[0] == index_start:
    return df_with_duplicates

  if index_start == None:
    index_start = df_with_duplicates.index[0]

  first_element = df_with_duplicates.loc[index_start]

  second_element = df_with_duplicates.iloc[df_with_duplicates.index.tolist().index(index_start)+1]

  # Features that will decipher whether the two rows are apart of the same play
  matching_features = ['Season', 'Week', 'Date', 'AwayTeam', 'HomeTeam', 'Quarter', 'DriveNumber', 'TeamWithPossession', 'PlayNumberInDrive']

  # - Check to see if 1st and 2nd elements are match
  if first_element[matching_features].equals(second_element[matching_features]):
    # 1. remove 2nd element
    df_with_duplicates = df_with_duplicates.drop(df_with_duplicates.index[df_with_duplicates.index.tolist().index(index_start)+1], inplace=False)
    # 2. run method starting search from 1st element
    #    - This is in case more matches to 1st element
    return no_duplicates(df_with_duplicates, index_start)
  else:
    # 1. run method starting search from 2nd element
    #    - 2nd element will become '1st element'
    #    - after 2nd element will become '2nd element'
    return no_duplicates(df_with_duplicates, df_with_duplicates.index[df_with_duplicates.index.tolist().index(index_start)+1])

**Table Creation**
- Goal is to mirror tables on NFL.com for testing purposes

In [None]:
# Display the scoring table for a specified game.

def score_table(away_team, home_team, df_cleaned_plays, dict_of_teams):
  df_all_plays_in_game = df_cleaned_plays.loc[(df_cleaned_plays['HomeTeam'] == home_team) &
                                              (df_cleaned_plays['AwayTeam'] == away_team)]

  df_all_plays_in_game = no_duplicates(df_all_plays_in_game)

  teams = [[away_team],[home_team]]

  for i in range(len(teams)):
    total_score = 0
    for quarter in df_all_plays_in_game['Quarter'].unique():
      quarter_score = 0

      # touchdowns
      df_touchdowns_in_quarter = df_all_plays_in_game.loc[(df_all_plays_in_game['PlayOutcome'].str.contains(f'touchdown {teams[i]}', case=False)) &
                                                          (df_all_plays_in_game['Quarter'] == quarter)]
      quarter_score += df_touchdowns_in_quarter.shape[0] * 6

      # PAT
      df_extra_points_in_quarter = df_all_plays_in_game.loc[(df_all_plays_in_game['PlayOutcome'].str.contains('Extra Point Good', case=False)) &
                                                            (df_all_plays_in_game['Quarter'] == quarter) &
                                                            (df_all_plays_in_game['TeamWithPossession'] == dict_of_teams.get(teams[i][0]))]
      quarter_score += df_extra_points_in_quarter.shape[0] * 1

      # field goals
      df_field_goals_in_quarter = df_all_plays_in_game.loc[(df_all_plays_in_game['PlayOutcome'].str.contains('Field Goal Good', case = False)) &
                                                          (df_all_plays_in_game['Quarter'] == quarter) &
                                                          (df_all_plays_in_game['TeamWithPossession'] == dict_of_teams.get(teams[i][0]))]
      quarter_score += df_field_goals_in_quarter.shape[0] * 3

      teams[i].append(quarter_score)
      total_score += quarter_score

    teams[i].append(total_score)
    teams[i].pop(0)

  scoring_columns = df_all_plays_in_game['Quarter'].unique().tolist()

  scoring_columns.append("Total")

  return pd.DataFrame(teams, columns = scoring_columns, index=[dict_of_teams.get(away_team), dict_of_teams.get(home_team)])

In [None]:
# Display Quarterback stats for a specified game

def quarterback_table(away_team, home_team, df_cleaned_plays, dict_acronym_to_team):

  # All plays within game
  df_all_plays_in_game = df_cleaned_plays.loc[(df_cleaned_plays['HomeTeam'] == home_team) &
                                              (df_cleaned_plays['AwayTeam'] == away_team)]

  # list of quarterbacks in game
  list_qbs = df_all_plays_in_game['Passer'].unique().tolist()
  if 'nan' in list_qbs:
    list_qbs.pop(list_qbs.index('nan'))

  #   key: quarterback
  # value: team
  dict_qbs_to_team = {}
  for qb in list_qbs:
    dict_qbs_to_team[qb] = dict_acronym_to_team.get(df_all_plays_in_game['TeamWithPossession'].loc[df_all_plays_in_game['Passer'] == qb].value_counts().index[0])

  df_quarterback_data = pd.DataFrame(columns=["CP/ATT", "YDS", "TD", "INT"], index = list(dict_qbs_to_team.keys()))

  # Grabbing data for each quarterback in game
  for qb in df_quarterback_data.index:

    passing_attempts = df_all_plays_in_game.loc[(df_all_plays_in_game['Passer'] == qb) &
                                                (df_all_plays_in_game['PlayType'] == "Pass")]

    passing_completions = passing_attempts.loc[(passing_attempts['PlayOutcome'].str.contains('yard pass', case=False)) |
                                               (passing_attempts['PlayOutcome'].str.contains(f'touchdown {dict_qbs_to_team.get(qb)}', case=False)) |
                                               (passing_attempts['PlayOutcome'].str.contains("Turnover On Downs \(C\)", case=False)) |
                                               (passing_attempts['PlayOutcome'].str.contains("Pass for No Gain", case=False)) |
                                               (passing_attempts['PlayOutcome'].str.contains("Fumble \(C\)", case=False))]

    df_quarterback_data.loc[qb, 'CP/ATT'] = f"{passing_completions.shape[0]}/{passing_attempts.shape[0]}"

    df_quarterback_data.loc[qb, 'YDS'] = int(passing_completions['Yardage'].sum())

    total_touchdowns = passing_completions.loc[passing_completions['PlayOutcome'].str.contains(f'touchdown {dict_qbs_to_team.get(qb)}', case=False)]

    df_quarterback_data.loc[qb, 'TD'] = total_touchdowns.shape[0]

    total_interceptions = passing_attempts.loc[passing_attempts['PlayDescription'].str.contains('intercepted', case=False)]

    df_quarterback_data.loc[qb, 'INT'] = total_interceptions.shape[0]

  return df_quarterback_data

## Home and Away teams (Week 1, 2023)

In [None]:
# Season 2023 Week 1 schedule

df_2023_week1_schedule = df_week1_plays_cleaned[['HomeTeam', 'AwayTeam', 'Season', 'Date', 'Day']].drop_duplicates().sort_values(by='Date').reset_index(drop=True)

df_2023_week1_schedule

In [None]:
dict_teams = {
    'Cardinals': 'ARI', 'Falcons': 'ATL', 'Ravens': 'BAL', 'Bills': 'BUF', 'Panthers': 'CAR', 'Bears': 'CHI',
    'Bengals': 'CIN', 'Browns': 'CLE', 'Cowboys': 'DAL', 'Broncos': 'DEN', 'Lions': 'DET', 'Packers': 'GB',
    'Texans': 'HOU', 'Colts': 'IND', 'Jaguars': 'JAX', 'Chiefs': 'KC', 'Raiders': 'LV', 'Chargers': 'LAC',
    'Rams': 'LAR', 'Dolphins': 'MIA', 'Vikings': 'MIN', 'Patriots': 'NE', 'Saints': 'NO', 'Giants': 'NYG',
    'Jets': 'NYJ', 'Eagles': 'PHI', 'Steelers': 'PIT', '49ers': 'SF', 'Seahawks': 'SEA', 'Buccaneers': 'TB',
    'Titans': 'TEN', 'Commanders': 'WAS'
}

In [None]:
dict_teams_2 = {
    'ARI': 'Cardinals', 'ATL': 'Falcons', 'BAL': 'Ravens', 'BUF': 'Bills', 'CAR': 'Panthers', 'CHI': 'Bears',
    'CIN': 'Bengals', 'CLE': 'Browns', 'DAL': 'Cowboys', 'DEN': 'Broncos', 'DET': 'Lions', 'GB': 'Packers',
    'HOU': 'Texans', 'IND': 'Colts', 'JAX': 'Jaguars', 'KC': 'Chiefs', 'LV': 'Raiders', 'LAC': 'Chargers',
    'LAR': 'Rams', 'MIA': 'Dolphins', 'MIN': 'Vikings', 'NE': 'Patriots', 'NO': 'Saints', 'NYG': 'Giants',
    'NYJ': 'Jets', 'PHI': 'Eagles', 'PIT': 'Steelers', 'SF': '49ers', 'SEA': 'Seahawks', 'TB': 'Buccaneers',
    'TEN': 'Titans', 'WAS': 'Commanders'
}

## Scoring Table
COLUMNS:
- Each quarter of the game
ROW:
- Each team playing in game

In [None]:
# Some games may not have every play recorded.
# (Week 1 2023, Game 1, 3rd quarter)
# - A field goal was supposed to be recorded after the interception touchdown but
#   was not.

game_num = 10

away_team = df_2023_week1_schedule['AwayTeam'].iloc[game_num]
home_team = df_2023_week1_schedule['HomeTeam'].iloc[game_num]

score_table(away_team, home_team, df_week1_plays_cleaned, dict_teams)

## Passing Table

INDEX:
- Each quarterback that played in game
COLUMNS:
- CP/ATT - completions / pass attempts
- YDS - total passing yards
- TD - total touchdowns thrown
- INT - total interceptions thrown

In [None]:
game_num = 10

away_team = df_2023_week1_schedule['AwayTeam'].iloc[game_num]
home_team = df_2023_week1_schedule['HomeTeam'].iloc[game_num]

quarterback_table(away_team, home_team, df_week1_plays_cleaned, dict_teams_2)

## Rushing Table

In [None]:
game_num = 1

away_team = df_2023_week1_schedule['AwayTeam'].iloc[game_num]
home_team = df_2023_week1_schedule['HomeTeam'].iloc[game_num]

df_all_plays_in_game = df_week1_plays_cleaned.loc[(df_week1_plays_cleaned['HomeTeam'] == home_team) &
                                                  (df_week1_plays_cleaned['AwayTeam'] == away_team)]

df_home_rushing_plays = df_all_plays_in_game.loc[(df_all_plays_in_game['PlayType'].str.contains('run', case=False)) &
                                                 (df_all_plays_in_game['TeamWithPossession'] == dict_teams.get(home_team))]

df_away_rushing_plays = df_all_plays_in_game.loc[(df_all_plays_in_game['PlayType'].str.contains('run', case=False)) &
                                                 (df_all_plays_in_game['TeamWithPossession'] == dict_teams.get(away_team))]

home_team_rushers = df_home_rushing_plays['Rusher'].unique().tolist()

if 'nan' in home_team_rushers:
  home_team_rushers.pop(home_team_rushers.index('nan'))

away_team_rushers = df_away_rushing_plays['Rusher'].unique().tolist()

if 'nan' in away_team_rushers:
  away_team_rushers.pop(away_team_rushers.index('nan'))

In [None]:
# home

df_rusher_data = pd.DataFrame(columns=["CAR", "YDS", "TD", "AVG"], index = home_team_rushers)

for rb in df_rusher_data.index:
  rusher_plays = df_home_rushing_plays.loc[df_home_rushing_plays['Rusher'] == rb]
  df_rusher_data.loc[rb, 'CAR'] = rusher_plays.shape[0]
  df_rusher_data.loc[rb, 'YDS'] = int(rusher_plays['Yardage'].sum())
  df_rusher_data.loc[rb, 'TD'] = rusher_plays.loc[rusher_plays['PlayOutcome'].str.contains('touchdown', case=False)].shape[0]
  df_rusher_data.loc[rb, 'AVG'] = round(rusher_plays['Yardage'].mean(), 2)

df_rusher_data.sort_values(by="YDS", ascending=False, inplace=True)
df_rusher_data

In [None]:
# away

df_rusher_data = pd.DataFrame(columns=["CAR", "YDS", "TD", "AVG"], index = away_team_rushers)

for rb in df_rusher_data.index:
  rusher_plays = df_away_rushing_plays.loc[df_away_rushing_plays['Rusher'] == rb]
  df_rusher_data.loc[rb, 'CAR'] = rusher_plays.shape[0]
  df_rusher_data.loc[rb, 'YDS'] = int(rusher_plays['Yardage'].sum())
  df_rusher_data.loc[rb, 'TD'] = rusher_plays.loc[rusher_plays['PlayOutcome'].str.contains('touchdown', case=False)].shape[0]
  df_rusher_data.loc[rb, 'AVG'] = round(rusher_plays['Yardage'].mean(), 2)

df_rusher_data.sort_values(by="YDS", ascending=False, inplace=True)
df_rusher_data

In [None]:
rusher_plays = df_away_rushing_plays.loc[df_away_rushing_plays['Rusher'].str.contains('A.Dillon')]

for idx, play in rusher_plays['PlayDescription'].items():
  print(idx)
  play_split = play.split(". ")
  for i in play_split:
    print(i)
  print()

##Receiving Table

In [None]:
# Check game 0. Need to work on offensive and defensive penalties.

game_num = 1

away_team = df_2023_week1_schedule['AwayTeam'].iloc[game_num]
home_team = df_2023_week1_schedule['HomeTeam'].iloc[game_num]

df_all_plays_in_game = df_week1_plays_cleaned.loc[(df_week1_plays_cleaned['HomeTeam'] == home_team) &
                                                  (df_week1_plays_cleaned['AwayTeam'] == away_team)]

df_home_receiving_plays = df_all_plays_in_game.loc[(df_all_plays_in_game['PlayType'].str.contains('pass', case=False)) &
                                                 (df_all_plays_in_game['TeamWithPossession'] == dict_teams.get(home_team))]

df_away_receiving_plays = df_all_plays_in_game.loc[(df_all_plays_in_game['PlayType'].str.contains('pass', case=False)) &
                                                 (df_all_plays_in_game['TeamWithPossession'] == dict_teams.get(away_team))]

home_team_receivers = df_home_receiving_plays['Receiver'].unique().tolist()

if 'nan' in home_team_receivers:
  home_team_receivers.pop(home_team_receivers.index('nan'))

away_team_receivers = df_away_receiving_plays['Receiver'].unique().tolist()

if 'nan' in away_team_receivers:
  away_team_receivers.pop(away_team_receivers.index('nan'))

In [None]:
# home

df_receiver_data = pd.DataFrame(columns=["REC", "YDS", "TD", "TGTS"], index = home_team_receivers)

for receiver in df_receiver_data.index:
  receiver_plays = df_home_receiving_plays.loc[df_home_receiving_plays['Receiver'] == receiver]
  df_receiver_data.loc[receiver, 'REC'] = receiver_plays.loc[(receiver_plays['PlayOutcome'].str.contains('yard pass', case=False)) |
                                                             (receiver_plays['PlayOutcome'].str.contains(f'touchdown', case=False)) |
                                                             (receiver_plays['PlayOutcome'].str.contains("Turnover On Downs \(C\)", case=False)) |
                                                             (receiver_plays['PlayOutcome'].str.contains("Pass for No Gain", case=False)) |
                                                             (receiver_plays['PlayOutcome'].str.contains("Fumble \(C\)", case=False))].shape[0]
  df_receiver_data.loc[receiver, 'YDS'] = int(receiver_plays['Yardage'].sum())
  df_receiver_data.loc[receiver, 'TD'] = receiver_plays.loc[receiver_plays['PlayOutcome'].str.contains(f'touchdown {home_team}', case=False)].shape[0]
  df_receiver_data.loc[receiver, 'TGTS'] = receiver_plays.shape[0]

df_receiver_data.sort_values(by="YDS", ascending=False, inplace=True)
df_receiver_data

In [None]:
# away

df_receiver_data = pd.DataFrame(columns=["REC", "YDS", "TD", "TGTS"], index = away_team_receivers)

for receiver in df_receiver_data.index:
  receiver_plays = df_away_receiving_plays.loc[df_away_receiving_plays['Receiver'] == receiver]
  df_receiver_data.loc[receiver, 'REC'] = receiver_plays.loc[(receiver_plays['PlayOutcome'].str.contains('yard pass', case=False)) |
                                                             (receiver_plays['PlayOutcome'].str.contains(f'touchdown', case=False)) |
                                                             (receiver_plays['PlayOutcome'].str.contains("Turnover On Downs \(C\)", case=False)) |
                                                             (receiver_plays['PlayOutcome'].str.contains("Pass for No Gain", case=False)) |
                                                             (receiver_plays['PlayOutcome'].str.contains("Fumble \(C\)", case=False))].shape[0]
  df_receiver_data.loc[receiver, 'YDS'] = int(receiver_plays['Yardage'].sum())
  df_receiver_data.loc[receiver, 'TD'] = receiver_plays.loc[receiver_plays['PlayOutcome'].str.contains(f'touchdown {away_team}', case=False)].shape[0]
  df_receiver_data.loc[receiver, 'TGTS'] = receiver_plays.shape[0]

df_receiver_data.sort_values(by="YDS", ascending=False, inplace=True)
df_receiver_data