<a href="https://colab.research.google.com/github/Keoni808/NFL_Data_Cleaning/blob/main/NFL_Plays_Week1_2023.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

PURPOSE:
- To view a larger sample size of plays.
  - Currently working on breaking down a single game but do not have enough data in that game to correctly break down all play descriptions for different play types.

# MOUNTING AND IMPORTS

In [1]:
# Mount your Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# Used to access personal google cloud services
from google.colab import auth
auth.authenticate_user()
print('Authenticated')

Authenticated


In [3]:
# Installs
!pip install ipdb

Collecting ipdb
  Downloading ipdb-0.13.13-py3-none-any.whl.metadata (14 kB)
Collecting jedi>=0.16 (from ipython>=7.31.1->ipdb)
  Using cached jedi-0.19.1-py2.py3-none-any.whl.metadata (22 kB)
Downloading ipdb-0.13.13-py3-none-any.whl (12 kB)
Using cached jedi-0.19.1-py2.py3-none-any.whl (1.6 MB)
Installing collected packages: jedi, ipdb
Successfully installed ipdb-0.13.13 jedi-0.19.1


In [4]:
# Imports

# Data manipulation
import pandas as pd

# Regular expressions
import re

# Grab data from database
from google.cloud import bigquery

# Debugging
import ipdb

In [5]:
# Turning on automatic debugger
%pdb on

Automatic pdb calling has been turned ON


# LOADING DATA (BigQuery queries)

In [6]:
# Client connect to bigquery project
client = bigquery.Client('nfl-data-430702')

## Season 2023 Week 1

In [7]:
# Grabbing all plays from Super Bowl 2023
week1_2023_plays_query = """
                         SELECT *
                         FROM `nfl-data-430702.NFL_Scores.NFL-Plays-Week1_2023`
                         """

# Running psuedo query, and returns the amount of bytes it will take to run query
dry_run_config = bigquery.QueryJobConfig(dry_run=True)
dry_run_query = client.query(week1_2023_plays_query, job_config=dry_run_config)
print("This query will process {} bytes.".format(dry_run_query.total_bytes_processed))

# Running query (Being mindful of the amount of data being grabbed)
# Will grab a maximum of a Gigabyte
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**9)
safe_config_query = client.query(week1_2023_plays_query, job_config=safe_config)

This query will process 570194 bytes.


In [8]:
# Putting data attained from query into a dataframe
week1_2023_plays = safe_config_query.to_dataframe()

In [9]:
week1_2023_plays.head()

Unnamed: 0,Season,Week,Day,Date,AwayTeam,HomeTeam,Quarter,DriveNumber,TeamWithPossession,IsScoringDrive,PlayNumberInDrive,IsScoringPlay,PlayOutcome,PlayDescription,PlayStart
0,2023,Week 1,MON,09/11,Bills,Jets,1ST QUARTER,1,BUF,0,1,0,Kickoff,G.Zuerlein kicks 65 yards from NYJ 35 to end z...,Kickoff from NYJ 35
1,2023,Week 1,MON,09/11,Bills,Jets,1ST QUARTER,1,BUF,0,2,0,7 Yard Pass,(15:00) (Shotgun) J.Allen pass short right to ...,1st & 10 at BUF 25
2,2023,Week 1,MON,09/11,Bills,Jets,1ST QUARTER,1,BUF,0,3,0,5 Yard Pass,"(14:34) (No Huddle, Shotgun) J.Allen pass shor...",2nd & 3 at BUF 32
3,2023,Week 1,MON,09/11,Bills,Jets,1ST QUARTER,1,BUF,0,4,0,3 Yard Run,(14:01) J.Cook up the middle to BUF 40 for 3 y...,1st & 10 at BUF 37
4,2023,Week 1,MON,09/11,Bills,Jets,1ST QUARTER,1,BUF,0,5,0,2 Yard Run,(13:24) (Shotgun) J.Cook up the middle to BUF ...,2nd & 7 at BUF 40


In [10]:
# Observation of the amount of data being worked on
week1_2023_plays.shape

(2600, 15)

# CATEGORIZE PLAYS
- The goal here is to parse out the different values for 'PlayOutcome'
  - separate pass / run / kickoff / etc.

## PARSING


In [11]:
# Maybe try to fuzzywuzzy this in the future?

# All play outcomes from the game
# - From here we can categorize and clean plays accordingly
week1_2023_plays['PlayOutcome'].unique()

array(['Kickoff', '7 Yard Pass', '5 Yard Pass', '3 Yard Run',
       '2 Yard Run', 'Pass Incomplete', 'Punt', '-5 Yard Penalty',
       '5 Yard Run', '1 Yard Pass', '14 Yard Run', '3 Yard Pass',
       '8 Yard Run', '6 Yard Pass', '15 Yard Pass', '-9 Yard Sack',
       '4 Yard Pass', '13 Yard Pass', 'Field Goal', '-2 Yard Sack',
       'Interception', '-5 Yard Run', '18 Yard Pass', '8 Yard Pass',
       '6 Yard Run', '12 Yard Run', '-1 Yard Run', '26 Yard Pass',
       'Touchdown Bills', 'Extra Point Good', '13 Yard Run',
       '-3 Yard Sack', '7 Yard Run', '9 Yard Pass', '4 Yard Run',
       'Fumble', '-10 Yard Penalty', '10 Yard Pass', '26 Yard Run',
       '5 Yard Penalty', '-10 Yard Sack', '22 Yard Pass', '-4 Yard Run',
       '-12 Yard Sack', '83 Yard Run', '1 Yard Run', '2 Yard Pass',
       '10 Yard Run', 'Run for No Gain', '12 Yard Pass', '20 Yard Pass',
       '9 Yard Run', '-2 Yard Pass', 'Sack', '24 Yard Pass',
       '14 Yard Pass', 'Touchdown Jets', '-3 Yard Run', '-2 Yar

In [12]:
# There are more types of plays that I have not made yet for Week 1.

# Looking at all unique play outcomes and categorizing them.
# - This type of approach does not feel very flexable because a play outcome can
#   arise that has not been seen yet.
# - There may be more in the future when working on a full season, let alone all seasons and future games
df_2023_pass_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Pass')]
df_2023_run_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Run')]

# df_2023_punt_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Punt')]
# df_2023_sack_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Sack')]
# df_2023_kickoff_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Kickoff')]
# df_2023_fumble_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Fumble')]
# df_2023_interception_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Interception')]
# df_2023_penalty_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Penalty')]
# df_2023_fieldgoal_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Field Goal')]
# df_2023_touchdown_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Touchdown')]
# df_2023_extrapoint_week1 = week1_2023_plays[week1_2023_plays['PlayOutcome'].str.contains('Extra Point')]

# plays_list = [df_2023_pass_sb,
#               df_2023_run_sb,
#               df_2023_punt_sb,
#               df_2023_sack_sb,
#               df_2023_kickoff_sb,
#               df_2023_fumble_sb,
#               df_2023_interception_sb,
#               df_2023_penalty_sb,
#               df_2023_fieldgoal_sb,
#               df_2023_touchdown_sb,
#               df_2023_extrapoint_sb]

## SANITY CHECK (All Plays Accounted for)
- NOT COMPLETE
  - Still need to grab other play types

## HELPER METHODS

In [13]:
# PURPOSE:
# - Quick look at a section of plays
#   - Ideally the plays that the user wants to break down and clean.
# INPUT PARAMETERS:
# df_all_plays      - DataFrame - The original dataframe where the desired plays to view came from
# df_section_plays  - DataFrame - A section of the original dataframe the user wants to view
# RETURN:
# - Printing to the console:
#   1. index of play
#   2. 'PlayDescription' feature of play
#   3. 'PlayOutcome' feature of play
def print_plays(df_all_plays, df_section_plays):
  for idx, value in df_section_plays['PlayOutcome'].items():
    print("index:" + str(idx))
    play = df_all_plays['PlayDescription'].iloc[idx]
    print(play)
    print(value)
    print()

# Fumbled plays (Pass & Run)
- Only looking for fumbled plays

In [14]:
# 1.1 ATTEMPT TO HANDLE FUMBLES.

# - Given that there are many different formats of a fumbled play,
# I beleive I have found a way to break them down. Or at least a
# direction to go towards.
# - Plan of action:
#   1. Split fumble play description by sentences.
#      - No matter what format a fumble play is in, each moment is separated
#        into sentences (e.g. "(1)QB throws to Receiver. (2)Receiver fumbled by Defender.")
#      - Split a fumble play description by sentences and break down each patterned sentence
#   2. Create 2 lists
#       1 - Contains every sentence within play description.
#           - Within this list I want to keep all sentences in their original order.
#       2 - Contains the main play sentence.
#           - The main play sentences are formated exactly like a regular run or pass play
#             that I have already previously broken down.
#               - This is a reason to have this check in the beginning of breaking down
#                 pass or run plays. If I can single out the main play of the fumble,
#                 then I can run that through the already made run or pass play breakdown
#                 and stick everything else into the feature 'fumble_information'.
#                 - That being said I will still grab bits of information from the other
#                   sentences within the fumble play and place them into their respective
#                   features.
#   3. Grab wanted data from each sentence and place into features.
#   4. Send main play into pass/run breakdown.

# I am making a list that is the same size as the list of sentences grabbed from the fumble play.
# The reason I am doing this is basically to keep the positions the sentences were originally in
# the same.

name_pattern = r'\b[A-Za-z]+\.[A-Za-z]+-?[A-Za-z]*\b' # Grabs all names but will only be used for Passer


for idx, value in df_2023_pass_week1['PlayOutcome'].items():
  play = week1_2023_plays['PlayDescription'].iloc[idx]


  if play.find('FUMBLES') != -1:
    # 1. Splitting the play description by sentences.
    fumble_play_elements = play.split(". ")
    # 2. Creating 2 lists
    # Making 'fumble_information' the same size to keep non main play sentences
    # in their original order. (Take out 'None' at original index and replace with sentence)
    fumble_information = [None] * len(fumble_play_elements)

    # Doing this incase there are multiple fumbles to handle in a single play.
    # while(play.find('FUMBLES') != -1):

    # cycle through all sentences

    # I need to figure out some way to loop through every sentence within a list
    # and also be able to subtract from that list?
    # I have an issue where there are multiple fumbles within a single play.
    # I would like to get this to a point where I can

    for i in fumble_play_elements:
      # ~ Fumble somehow only involving the quarterback alone. ~
      # - There will only be a single player within this sentence.
      # Data to grab:
      # ?
      passer = re.findall(name_pattern, i)
      if len(passer) == 1:
        fumble_information.pop(fumble_play_elements.index(i))
        fumble_information.insert(fumble_play_elements.index(i), i)
      # ~ Fumble and recovery ~
      # - So far what I see here is that in this sentence there
      #   is (1) the defender who forced the fumble. And there
      #   is (2) who recovered the fumble.
      #   - Something noteable is that if the opposing team
      #     recovers the fumble it is displayed as 'RECOVERED'.
      #     If it is recovered by the driving team, it is
      #     displayed as 'recovered'.
      # Data to grab:
      # 1. Who forced fumble?
      # 2. Who recovered?
      #     - Since this is starting to take a lot of space, I may
      #       not include these in the final dataframe.
      if i.find('FUMBLES') != -1:
        fumble_information.pop(fumble_play_elements.index(i))
        fumble_information.insert(fumble_play_elements.index(i), i)
      # ~ Play has been reversed ~
      # - The original place has been differed and the driving team gets
      #   another shot at another play.
      #   - This is important because that would mean that the original play
      #     does not count anymore, so I will need to somehow only grab the
      #     second "main" play to breakdown.
      if i.find('REVERSED') != -1:
        fumble_information.pop(fumble_play_elements.index(i))
        fumble_information.insert(fumble_play_elements.index(i), i)
        play_before_reverse = fumble_play_elements[0]
        fumble_information.pop(0)
        fumble_information.insert(0, play_before_reverse)
        print(fumble_play_elements.index(i))
        print(fumble_play_elements)
        fumble_play_elements = fumble_play_elements[fumble_play_elements.index(i)+1::]
        i = 0
        print()
        print(fumble_play_elements)
        # I wonder if you can set 'fumble_play_elements' from this point forth
        # and reset i to equal 0?
      # ~ Official ruling ~
      # Maybe will address outside of this.
      # - Official ruling will state yardage gained from play.
      #   - But I do not think that a setence like that will appear only in
      #     fumble play descriptions.

      # I need to make sure that the sentence that goes into the main breakdown
      # includes everything that is needed. (e.g. injuries)
      # - What happens if there is another fumble after this?

    main_plays = [item for item in fumble_play_elements if item not in fumble_information]
    # print(main_plays)
    print()
  # play = ". ".join(main_plays)
  # print(fumble_information)
  # print(play)
  # print(value)
  # print()


  # I need to grab information from a fumble
  # if there is another fumble I need to grab that information
  # if tehre is another I need to grab that one
  # and so on and so on.


3
['(14:15) T.Lawrence pass short right to C.Ridley to JAX 47 for 14 yards (R.Thomas, E.Speed)', 'FUMBLES (E.Speed), RECOVERED by IND-E.Speed at IND 49', 'E.Speed ran ob at IND 49 for no gain', 'The Replay Official reviewed the ball was inbounds ruling, and the play was REVERSED', 'T.Lawrence pass short right to C.Ridley to JAX 47 for 14 yards (R.Thomas, E.Speed)', 'FUMBLES (E.Speed), ball out of bounds at IND 49', 'IND-K.Moore was injured during the play', 'IND-D.Flowers was injured during the play.']

['T.Lawrence pass short right to C.Ridley to JAX 47 for 14 yards (R.Thomas, E.Speed)', 'FUMBLES (E.Speed), ball out of bounds at IND 49', 'IND-K.Moore was injured during the play', 'IND-D.Flowers was injured during the play.']



2
['(6:32) (Shotgun) B.Mayfield pass short middle to C.Otton to MIN 47 for 14 yards (A.Evans)', 'FUMBLES (A.Evans), recovered by TB-T.Palmer at MIN 43', 'Minnesota challenged the pass completion ruling, and the play was REVERSED', '(Shotgun) B.Mayfield pass in

In [40]:
# Most promising iteration so far

name_pattern = r'(?<!-)[A-Za-z]+\.[A-Za-z]+-?[A-Za-z]* $' # Grabs all names but will only be used for Passer

for idx, value in df_2023_pass_week1['PlayOutcome'].items():
  play = week1_2023_plays['PlayDescription'].iloc[idx]

  if play.find('FUMBLES') != -1:
    fumble_play_elements = play.split(". ")
    fumble_details = [None] * len(fumble_play_elements)
    main_play = []

    for i in fumble_play_elements:
      main_play.append(i)

      passer = re.findall(name_pattern, i)
      if len(passer) == 1:
        # Take out of main_play (will be at the end of the list)
        main_play.pop(main_play.index(i))
        # This will be added to fumble_details in the same index as the original.
        fumble_details.pop(fumble_play_elements.index(i))
        fumble_details.insert(fumble_play_elements.index(i), i)

      if i.find('FUMBLES') != -1:
        # This will not be added to main play.
        main_play.pop(len(main_play) - 1)
        # This will be added to fumble_details in the same index as the original.
        fumble_details.pop(fumble_play_elements.index(i))
        fumble_details.insert(fumble_play_elements.index(i), i)

      if i.find('REVERSED') != -1:
        for j in main_play:
          fumble_details.pop(fumble_play_elements.index(j))
          fumble_details.insert(fumble_play_elements.index(j), j)
        main_play.clear()

    print(fumble_details)
    print(main_play)
    print(value)
    print()

[None, 'FUMBLES, and recovers at CHI 46', None]
['(14:21) J.Love to CHI 44 for -3 yards', 'J.Love pass deep left to L.Musgrave to CHI 4 for 37 yards (T.Stevenson) [D.Walker].']
37 Yard Pass

['(14:15) T.Lawrence pass short right to C.Ridley to JAX 47 for 14 yards (R.Thomas, E.Speed)', 'FUMBLES (E.Speed), RECOVERED by IND-E.Speed at IND 49', 'E.Speed ran ob at IND 49 for no gain', 'The Replay Official reviewed the ball was inbounds ruling, and the play was REVERSED', None, 'FUMBLES (E.Speed), ball out of bounds at IND 49', None, None]
['T.Lawrence pass short right to C.Ridley to JAX 47 for 14 yards (R.Thomas, E.Speed)', 'IND-K.Moore was injured during the play', 'IND-D.Flowers was injured during the play.']
14 Yard Pass

[None, 'FUMBLES (B.Okereke), recovered by DAL-T.Biadasz at NYG 4.']
['(11:26) (Shotgun) D.Prescott pass short right to T.Pollard to NYG 12 for 7 yards (B.Okereke)']
7 Yard Pass

[None, 'FUMBLES (M.Bell), recovered by NYG-P.Campbell at NYG 35', None, None]
['(4:45) (Shot

In [18]:
name_pattern = r'(?<!-)[A-Za-z]+\.[A-Za-z]+-?[A-Za-z]*\b' # Grabs all names but will only be used for Passer

# I still want injured players. This is not complete.
# name_pattern = r'[A-Za-z]+\.[A-Za-z]+-?[A-Za-z]*\b' # Grabs all names but will only be used for Passer

for idx, value in df_2023_pass_week1['PlayOutcome'].items():
  play = week1_2023_plays['PlayDescription'].iloc[idx]


  if play.find('FUMBLES') != -1:
    # list of all elements within fumble play description.
    fumble_play_elements = play.split(". ")

    fumble_details = [None] * len(fumble_play_elements)

    # list of elements that will run through run/pass breakdown
    main_play = []

    # Going through all elements
    for i in fumble_play_elements:

      main_play.append(i)

      #!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
      # This is the problem.
      # Injuried players follow the same pattern as Passers.
      passer = re.findall(name_pattern, i)
      if len(passer) == 1:
        # This will not be added to main play.
        main_play.pop(len(main_play) - 1)
        # This will be added to fumble_details in the same index as the original.
        fumble_details.pop(fumble_play_elements.index(i))
        fumble_details.insert(fumble_play_elements.index(i), i)

      if i.find('FUMBLES') != -1:
        # This will not be added to main play.
        main_play.pop(len(main_play) - 1)
        # This will be added to fumble_details in the same index as the original.
        fumble_details.pop(fumble_play_elements.index(i))
        fumble_details.insert(fumble_play_elements.index(i), i)

      if i.find('REVERSED') != -1:
        # send everything in main play to fumble_details in the same index as original.
        fumble_details.pop(fumble_play_elements.index(i))
        fumble_details.insert(fumble_play_elements.index(i), i)
        for j in main_play:
          # Discard data from main_play
          main_play.pop(main_play.index(j) - 1)
          # Add discarded data from main_play to fumble_details.
          fumble_details.pop(fumble_play_elements.index(j))
          fumble_details.insert(fumble_play_elements.index(j), j)

    for i in fumble_play_elements:
      print(i)
    print(value)
    print("BREAKDOWN")
    print(fumble_details)
    print(main_play)
    print()

(14:21) J.Love to CHI 44 for -3 yards
FUMBLES, and recovers at CHI 46
J.Love pass deep left to L.Musgrave to CHI 4 for 37 yards (T.Stevenson) [D.Walker].
37 Yard Pass
BREAKDOWN
['(14:21) J.Love to CHI 44 for -3 yards', 'FUMBLES, and recovers at CHI 46', None]
['J.Love pass deep left to L.Musgrave to CHI 4 for 37 yards (T.Stevenson) [D.Walker].']

(14:15) T.Lawrence pass short right to C.Ridley to JAX 47 for 14 yards (R.Thomas, E.Speed)
FUMBLES (E.Speed), RECOVERED by IND-E.Speed at IND 49
E.Speed ran ob at IND 49 for no gain
The Replay Official reviewed the ball was inbounds ruling, and the play was REVERSED
T.Lawrence pass short right to C.Ridley to JAX 47 for 14 yards (R.Thomas, E.Speed)
FUMBLES (E.Speed), ball out of bounds at IND 49
IND-K.Moore was injured during the play
IND-D.Flowers was injured during the play.
14 Yard Pass
BREAKDOWN
[None, 'FUMBLES (E.Speed), RECOVERED by IND-E.Speed at IND 49', 'E.Speed ran ob at IND 49 for no gain', 'The Replay Official reviewed the ball was 

In [19]:
for idx, value in df_2023_pass_week1['PlayOutcome'].items():
  play = week1_2023_plays['PlayDescription'].iloc[idx]
  if play.find('FUMBLES') != -1:
    print("index:" + str(idx))
    fumble_play_elements = play.split(". ")
    for i in fumble_play_elements:
      print(i)
    # print(play)
    print(value)
    print()

index:213
(14:21) J.Love to CHI 44 for -3 yards
FUMBLES, and recovers at CHI 46
J.Love pass deep left to L.Musgrave to CHI 4 for 37 yards (T.Stevenson) [D.Walker].
37 Yard Pass

index:423
(14:15) T.Lawrence pass short right to C.Ridley to JAX 47 for 14 yards (R.Thomas, E.Speed)
FUMBLES (E.Speed), RECOVERED by IND-E.Speed at IND 49
E.Speed ran ob at IND 49 for no gain
The Replay Official reviewed the ball was inbounds ruling, and the play was REVERSED
T.Lawrence pass short right to C.Ridley to JAX 47 for 14 yards (R.Thomas, E.Speed)
FUMBLES (E.Speed), ball out of bounds at IND 49
IND-K.Moore was injured during the play
IND-D.Flowers was injured during the play.
14 Yard Pass

index:872
(11:26) (Shotgun) D.Prescott pass short right to T.Pollard to NYG 12 for 7 yards (B.Okereke)
FUMBLES (B.Okereke), recovered by DAL-T.Biadasz at NYG 4.
7 Yard Pass

index:961
(4:45) (Shotgun) D.Jones pass short left to M.Breida to NYG 43 for 5 yards (M.Bell)
FUMBLES (M.Bell), recovered by NYG-P.Campbell at 

In [20]:
for idx, value in df_2023_run_week1['PlayOutcome'].items():
  play = week1_2023_plays['PlayDescription'].iloc[idx]
  if play.find('FUMBLES') != -1:
    print("index:" + str(idx))
    print(play)
    print(value)
    print()

index:115
(9:54) Bre.Hall left end to BUF 22 for -1 yards (G.Rousseau). FUMBLES (G.Rousseau), ball out of bounds at BUF 25.
-4 Yard Run

index:230
(2:08) S.Clifford FUMBLES (Aborted) at CHI 35, and recovers at CHI 35.
Run for No Gain

index:756
(6:44) (Shotgun) J.Goff Aborted. F.Ragnow FUMBLES at KC 24, recovered by DET-J.Goff at KC 27. J.Goff to KC 27 for no gain (G.Karlaftis).
Run for No Gain

index:826
(8:53) (Shotgun) D.Jones Aborted. J.Schmitz FUMBLES at DAL 18, recovered by NYG-D.Jones at DAL 27.
Run for No Gain

index:933
(9:27) (Shotgun) D.Jones FUMBLES (Aborted) at NYG 30, and recovers at NYG 30. D.Jones to NYG 32 for 2 yards (M.Smith).
Run for No Gain

index:1015
(6:33) (No Huddle, Shotgun) L.Jackson scrambles right end to HOU 20 for 6 yards (T.Thomas). FUMBLES (T.Thomas), recovered by BAL-K.Zeitler at HOU 23. HOU-H.Ridgeway was injured during the play.
3 Yard Run

index:1214
(1:39) J.Williams right tackle to TEN 9 for 11 yards (K.Byard, S.Murphy-Bunting). FUMBLES (S.Murphy-B