In [None]:
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive/')

Mounted at /content/drive/


## **Architecture**

1.) **Textual Data Pipeline (Broadcast Transcripts)**
- Tokenize and de-noise and  clean text
- Use a transformer based NLP Pre-trained model (BERT or GPT
- Fine-tune transformer model on sports commentary for feature extraction
- Features Extracted:
    - Player/team mentions
    - Sentiment/emotion tied to game momentum
    - Descriptive language for current play situations

2.) **Numerical Data Pipeline (Game Stats)**
- Normalize numerical features
- Use a feedforward neural network pre-trained model (FNN) to process numerical data
- Features Extracted:
  - Play Info: Down, distance, field position
  - Game Info: Current score
  - Player Stats: QB completion rate, rushing yards
  - Play event sequences (run, pass, penalty)
- Encode Categorical Data (one hot encodings?)

3.) **Temporal Modeling (Sequential Dependencies)**
- Use Recurrent Neural Network (RNN): Use LSTM or GRU models to process sequences of game states and textual embeddings from the transcript

  OR

- Temporal Transformer (ex. Time-BERT) can model long range dependencies in both text and numerical dependencies.

4.) **Fusion Layer**
- Combine features  extracted from NLP model, FNN numerical data, temporal embeddings, by concatenating feature vectors into a unified dimensional vector
- Use a dense layer to project the combined features into a common space and reduce dimensionality.

5.) **Output Layer**
- For a multi-event prediction use softmax activation function for multi-class classification
- Output  event probabilities for a fixed time horizon




#### **Script for Data Preparation**

In [None]:
"""import json
import os


# Define paths
# Update these paths based on your folder structure in Google Drive
input_json_path = '/content/drive/My Drive/Portfolio/NFL/EventPredictionNLPModel/Data/raw_broadcast_JSON/raw_transcripts.json'  # Replace with the path to your JSON file
output_folder_path = '/content/drive/My Drive/Portfolio/NFL/EventPredictionNLPModel/Data/raw_broadcast_TXT/'  # Replace with the path to your output folder

# Ensure the output folder exists
os.makedirs(output_folder_path, exist_ok=True)

# Load the JSON file
with open(input_json_path, 'r') as json_file:
    data = json.load(json_file)

# Extract the transcript and save as .txt files
for file_name, content in data.items():
    transcript_text = content.get("transcript", "")
    if transcript_text:  # Ensure the transcript exists
        # Define the output file path
        output_file_path = os.path.join(output_folder_path, f"{file_name.replace('.txt', '')}.txt")

        # Write the transcript to the file
        with open(output_file_path, 'w') as txt_file:
            txt_file.write(transcript_text)
        print(f"Saved transcript to: {output_file_path}")
    else:
        print(f"No transcript found for: {file_name}")

print("Extraction complete")"""

#### **Prep Dataset**
Eliminate all broadcast transcripts before 2001, to eliminate inconsistencies.

In [None]:
""""
# Define the directory path
directory_path = "/content/drive/My Drive/Portfolio/NFL/EventPredictionNLPModel/Data/raw_broadcast_TXT/"

# Ensure the directory exists
if not os.path.exists(directory_path):
    print(f"Directory not found: {directory_path}")
else:
    # Iterate through the files in the directory
    for filename in os.listdir(directory_path):
        # Check if the filename starts with a number less than 2001
        match = re.match(r'^(\d+)', filename)
        if match:
            number = int(match.group(1))
            if number < 2001:
                file_path = os.path.join(directory_path, filename)
                try:
                    # Delete the file
                    os.remove(file_path)
                    print(f"Deleted: {file_path}")
                except Exception as e:
                    print(f"Error deleting {file_path}: {e}")
    print("File cleanup completed.")""""

There are now **844** broadcast transcripts in the dataset

#### **Prep Dataset**
BERT cannot distinguishe between plays, therefore the transcripts need to be broken into chunks. Each chunk is dedicated to the broadcast commentary between the time in which a play begins and right before the next play begins. Therefore I used regular expressions and along with a few other rules to split the transcripts up play-by-play.

In [None]:
"""
# Specify the folder containing the text files
# Update this with your actual Google Drive folder path
input_folder_path = "/content/drive/My Drive/Portfolio/NFL/EventPredictionNLPModel/Data/raw_broadcast_TXT/"
output_folder_path = "/content/drive/My Drive/Portfolio/NFL/EventPredictionNLPModel/Data/new_processed_broadcast_TXT/"

# Create the output folder if it doesn't exist
os.makedirs(output_folder_path, exist_ok=True)

# Regular expression to identify play-related phrases
regex_pattern = r"(?i)\b(?:first down|second down|third down|fourth down|punt|first and|second and|third and|fourth and|field goal)\b"

def process_text(text):
    # Find all matches of the regex
    matches = list(re.finditer(regex_pattern, text))

    # If no matches, return the entire text as one segment
    if not matches:
        return [text.strip()]

    segments = []
    last_end = 0

    for i, match in enumerate(matches):
        start, end = match.start(), match.end()

        # Append the text from the last match to the current match
        if i > 0:
            segments.append(text[last_end:start].strip())
        last_end = start

    # Append the remaining text after the last match
    segments.append(text[last_end:].strip())
    return segments

# Process each text file in the folder
for file_name in os.listdir(input_folder_path):
    if file_name.endswith('.txt'):
        input_file_path = os.path.join(input_folder_path, file_name)
        output_file_path = os.path.join(output_folder_path, f"processed_{file_name}")

        with open(input_file_path, 'r') as file:
            text = file.read()

        # Process the text and split it into segments
        segments = process_text(text)

        # Write the segments to a new output file
        with open(output_file_path, 'w') as output_file:
            for segment in segments:
                output_file.write(segment + "\n\n")

print("Processing complete. Processed files are saved in the output folder.")"""


#### **80-10-10 Split:**
Training set (80%): *675 files*

Validation set (10%): *85 files*

Test set (10%): *84 files*

### **Feature Mapping**

1.) Rule-based Assgnment

2.) ML-based Predictions: For fields requiring contextual understanding like play_type and yards_gained

3.) Post-Processing and Computation: for calculated fields like quarter_seconds_remaining

#### **Feature Mapping Pipeline**
The mapping pipeline consists of:

* **Rule-based Assignment:** For static metadata and fields with straightforward mappings.
* **ML-based Predictions:** For fields requiring contextual understanding, like play_type and yards_gained.
* **Post-Processing and Computation:** For calculated fields (e.g., quarter_seconds_remaining).

* **Static Metadata:** Fields like game_id, home_team, away_team, game_date.
* **Dynamic Data:** Fields derived from textual features, e.g., desc, play_type, yards_gained.
* **Calculated Data:** Fields computed based on rules or models, e.g., yardline_100, quarter_seconds_remaining.

## **Desired .JSON**
Structure and content example for extracted features in .JSOn format

In [None]:
      # Static
      {
        "home_team": "Chiefs",
        "away_team": "Eagles",
        "game_date": "2022-10-16"
      }

      # Play Stats
      {
        "play_id": "1",
        "posteam": "Chiefs",
        "posteam_type": "offense",
        "defteam": "Eagles",
        "side_of_field": "KC",
        "quarter_seconds_remaining": "600",
        "half_seconds_remaining": "1800",
        "game_seconds_remaining": "3600",
        "game_half": "first",
        "qtr": "1",
        "down": "1",
        "yrdln": "KC 20",
        "ydstogo": "10",
        "desc": "Mahomes pass complete to Kelce for 20 yards",
        "play_type": "pass",
        "yards_gained": "20"
      }

Changes made from numerical game data:

In [None]:
# Deleted keys
        "game_id":
        "yardline_100": "80",
        "quarter_end": "0",
        "drive": "1",
        "sp": "0",
        "goal_to_go": "1",
        "time": "15:00",
        "ydsnet": "80",
# Keys to be potentially added
        "home_points": "10",
        "visitor_points": "5",
        "pentalty": "no",
        "timeout": "yes",
home_points:
- "it's [number]"

vistor_points:
- home_points:
- "it's [number]"

## **Textual Data Layer**
**Input:** .txt files containing broadcast data

**Pre-Trained Transformer:** BERT

**Output:** .json containing extracted features

1.) Tokenization

2.) De-noising

3.) BERT for feature extraction

This is the structure of my tectual data pipeline layer. How would I structure the cleaning, tokenizing denoising, and extract the features of the broadcast transcripts from the football games, with each game split up into txt files?

### **Individual Play Text Preparation**

"

*third down and seven runners got him today wide open he strikes with Demetrius Harris again and what a way to set up the drama a 35yard completion by kid wonder showtime mahomes as we hit the twominute warning and here come the cheats that kind of night on Monday Night Football as people if they hear all the Blues here at Mile High because the whole crowd felt that the plate clock had hit zero on that 35yard way you see it right there Joe right before that ball snapped it goes to zero well our longtime rules expert Jeff Triplett can't be challenged right yup that's correct this can't be challenged but what you got to understand is the back judge is looking at that play he looks at the flop play clock go to zero his mechanic is and then look at the snap so if he felt like the snap was going off then he then to play goes so here we are a*"

Above is a single chunk of text that is from and nfl broadcast. It represents the commentary made about a single play by the announcers until the next play begins. In order to test my method of extraction, I will be using a limited number of stats, ones which are the most common to be found in a chunk of commentary.

We will use regular expressions in combination with BERT to extract the following data:

* Down number
* Yardline
* Yards to go
* Play type

I have devised the following regular expressions to help BERT identify these these pieces of data to match up:

* **Down number:** r"(?m)^\s*$\n\s*(\w+)"

* **Yardline**: r"(?:\d+|[a-zA-Z]+)(?=\s*yard\s*line|yardline)"

* **Yards to go:** r"\n\n.*\band\s(\b\w+)"

* **Play type:** r"\b(?:pass|rush|punt|kick|handoff|throw|rushes|passes|throws|run|runs|kicks|punts|completion|completes|bootleg|scramble|running|shot down|strike|return|play action|playaction|caught|caught by)\b"

* **Yards gained:** r"\b(?:\d+|[a-zA-Z]+)\s*(?:yards?|on the play|penalty)\b|\bgain of\s*(?:\d+|[a-zA-Z]+)\b|\bloss of\s*(?:\d+|[a-zA-Z]+)\b|\bpicks\s*(?:\d+|[a-zA-Z]+)\b|\bgets\s*(?:\d+|[a-zA-Z]+)\b"

### Test the regular expressions

In [None]:
import re

# Sample text
text = """

second and four looks one way comes back the other now extending and gets back to Tyreke he'll the Broncos left right the Broncos are doing a great job Joe you see how my homes is Holt you see I he's petting the football he's holding it because the purpose is there they're confusing the young guy nice job by the different potholes defense he gets to be the nofly zone they've been giving up pass yards I used to see them playing well tonight you know when God they miss on the outside Sammy Watkins he has the speed and the size and the route running ability when the homes extended those place he can get separation and said well seeing Chris hotly in there Watkins out with a hamstring.
"""

# Regular expression pattern
pattern = r"(?m)^\s*$\n\s*(\w+)"


# Find all matches
matches = re.findall(pattern, text)

# Output the results
print("Matched words after blank lines:")
for match in matches:
    print(match)

In [None]:
import re

text = """

second and four looks one way comes back the o
"""

# Regular expression
pattern = r"\n\n.*\band\s(\b\w+)"

# Match
match = re.search(pattern, text)
if match:
    print(match.group(1))  # Output: one
else:
    print("No match found")


In [None]:
import re

# Regular expression
pattern = r"(?:\d+|[a-zA-Z]+)(?=\s*yard\s*line|yardline)"

# Test cases
test_strings = [
    "first down constantly extending plays giving his team a chance and then using that flamethrower of a right arm they were trying to get where out quick out the back cut off that bunch step to pick route Denver did a good job passing off but you know regardless of what happens tonight you know and this is not about chipping of holes best game statistically but he's showing that grit and determination on the road I like the moxie but I more importantly I like the poise and the noise going underneath Kelsey's spin spree and he's to the 9 yard line you know what that is bo"
]

# Apply regex and print matches
for text in test_strings:
    match = re.search(pattern, text)
    if match:
        print(f"Matched: '{match.group()}' in '{text}'")
    else:
        print(f"No match in '{text}'")


In [None]:
import re

# Regular expression
pattern = r"\b(?:pass|rush|punt|kick|handoff|throw|rushes|passes|throws|run|runs|kicks|punts|completion|completes|bootleg|scramble|running|shot down|strike|return|play action|playaction|caught|caught by)\b"

# Test cases
test_strings = [
    "third down and ten under five and a half to play and trying to come in with a win and hold on to a threepoint margin keenum he's gonna run for push out of bounds that time by Allen Bailey excellent play by Bailey to track keenum you know statistically this Kansas City defense has been in the bottom maybe even last in a lot of care categories but they've been number one on"
]

# Apply regex and print matches
for text in test_strings:
    match = re.search(pattern, text)
    if match:
        print(f"Matched: '{match.group()}' in '{text}'")
    else:
        print(f"No match in '{text}'")

In [None]:
import re

# Regular expression
pattern = r"\b(?:\d+|[a-zA-Z]+)\s*(?:yards?|on the play|penalty)\b|\bgain of\s*(?:\d+|[a-zA-Z]+)\b|\bloss of\s*(?:\d+|[a-zA-Z]+)\b|\bpicks\s*(?:\d+|[a-zA-Z]+)\b|\bgets\s*(?:\d+|[a-zA-Z]+)\b"

# Test cases
test_strings = [
    "push ahead as he makes his way to the 45 a gain of five for Hunter you know the one thing on Patrick mahomes it's rare to"
]

# Apply regex and print matches
for text in test_strings:
    match = re.search(pattern, text)
    if match:
        print(f"Matched: '{match.group()}' in '{text}'")
    else:
        print(f"No match in '{text}'")

## **Use RegEx and BERT to extract specific stats from each play of commentary**

In [None]:
import os
import re
import json
from transformers import BertTokenizer, BertModel
import torch
import numpy as np

In [None]:
pip install transformers torch



### Test this method

In [None]:
import re
from transformers import BertTokenizer, BertForMaskedLM
import torch

# Regular expressions for various pieces of data
down_number_regex = r"(?m)^\s*$\n\s*(\w+)"
yardline_regex = r"(?:\d+|[a-zA-Z]+)(?=\s*yard\s*line|yardline)"
yards_to_go_regex = r"\n\n.*\band\s(\b\w+)"
play_type_regex = r"\b(?:pass|rush|punt|kick|handoff|throw|rushes|passes|throws|run|runs|kicks|punts|completion|completes|bootleg|scramble|running|shot down|strike|return|play action|playaction|caught|caught by)\b"
yards_gained_regex = r"\b(?:\d+|[a-zA-Z]+)\s*(?:yards?|on the play|penalty)\b|\bgain of\s*(?:\d+|[a-zAZ]+)\b|\bloss of\s*(?:\d+|[a-zA-Z]+)\b|\bpicks\s*(?:\d+|[a-zA-Z]+)\b|\bgets\s*(?:\d+|[a-zA-Z]+)\b"

# Sample input text from the NFL broadcast
text = """

first down constantly extending plays giving his team a chance and then using that flamethrower of a right arm they were trying to get where out quick out the back cut off that bunch step to pick route Denver did a good job passing off but you know regardless of what happens tonight you know and this is not about chipping of holes best game statistically but he's showing that grit and determination on the road I like the moxie but I more importantly I like the poise and the noise going underneath Kelsey's spin spree and he's to the 9 yard line you know what that is bo
"""

# Initialize BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')

# Function to extract data using regex
def extract_data_using_regex(text):
    # Extract down number
    down_match = re.search(down_number_regex, text)
    down_number = down_match.group(1) if down_match else None

    # Extract yardline
    yardline_match = re.search(yardline_regex, text)
    yardline = yardline_match.group(0) if yardline_match else None

    # Extract yards to go
    yards_to_go_match = re.search(yards_to_go_regex, text)
    yards_to_go = yards_to_go_match.group(1) if yards_to_go_match else None

    # Extract play type
    play_type_match = re.search(play_type_regex, text)
    play_type = play_type_match.group(0) if play_type_match else None

    # Extract yards gained
    yards_gained_match = re.search(yards_gained_regex, text)
    yards_gained = yards_gained_match.group(0) if yards_gained_match else None

    return down_number, yardline, yards_to_go, play_type, yards_gained

# Function to fill missing data using BERT
def fill_using_bert(text, mask="<MASK>"):
    # Prepare input with masked information (where we need the model to infer)
    input_text = text.replace("<MASK>", mask)
    inputs = tokenizer(input_text, return_tensors='pt')

    # Make prediction with BERT
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = outputs.logits

    # Get predicted token
    predicted_token_id = torch.argmax(predictions[0, inputs['input_ids'].shape[1] - 1]).item()
    predicted_token = tokenizer.decode(predicted_token_id)

    return predicted_token

# Apply regex to extract known values
down_number, yardline, yards_to_go, play_type, yards_gained = extract_data_using_regex(text)

# If any values are missing, use BERT to predict or infer them
if not down_number:
    down_number = fill_using_bert(text, mask="down number")
if not yardline:
    yardline = fill_using_bert(text, mask="yardline")
if not yards_to_go:
    yards_to_go = fill_using_bert(text, mask="yards to go")
if not play_type:
    play_type = fill_using_bert(text, mask="play type")
if not yards_gained:
    yards_gained = fill_using_bert(text, mask="yards gained")

# Output the extracted data
print(f"Down number: {down_number}")
print(f"Yardline: {yardline}")
print(f"Yards to go: {yards_to_go}")
print(f"Play type: {play_type}")
print(f"Yards gained: {yards_gained}")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

BertForMaskedLM has generative capabilities, as `prepare_inputs_for_generation` is explicitly overwritten. However, it doesn't directly inherit from `GenerationMixin`. From 👉v4.50👈 onwards, `PreTrainedModel` will NOT inherit from `GenerationMixin`, and this model will lose the ability to call `generate` and other related functions.
  - If you are the owner of the model architecture code, please modify your model class such that it inherits from `GenerationMixin` (after `PreTrainedModel`, otherwise you'll get an exception).
  - If you are not the owner of the model architecture class, please contact the model code owner to update it.
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another archite

Down number: first
Yardline: 9
Yards to go: he
Play type: .
Yards gained: 9 yard


In [None]:
import os
import re
import json
from transformers import BertTokenizer, BertForMaskedLM
import torch
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

# Define paths
input_folder = "/content/drive/My Drive/Portfolio/NFL/EventPredictionNLPModel/Data/new_processed_broadcast_TXT/"
output_folder = "/content/drive/My Drive/Portfolio/NFL/EventPredictionNLPModel/Data/play_processed_json/"

# Create output folder if it doesn't exist
os.makedirs(output_folder, exist_ok=True)

# Initialize BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased')

# Regular expressions for various pieces of data
down_number_regex = r"(?m)^\s*$\n\s*(\w+)"
yardline_regex = r"(?:\d+|[a-zA-Z]+)(?=\s*yard\s*line|yardline)"
yards_to_go_regex = r"\n\n.*\band\s(\b\w+)"
play_type_regex = r"\b(?:pass|rush|punt|kick|handoff|throw|rushes|passes|throws|run|runs|kicks|punts|completion|completes|bootleg|scramble|running|shot down|strike|return|play action|playaction|caught|caught by)\b"
yards_gained_regex = r"\b(?:\d+|[a-zA-Z]+)\s*(?:yards?|on the play|penalty)\b|\bgain of\s*(?:\d+|[a-zAZ]+)\b|\bloss of\s*(?:\d+|[a-zA-Z]+)\b|\bpicks\s*(?:\d+|[a-zA-Z]+)\b|\bgets\s*(?:\d+|[a-zA-Z]+)\b"

# Function to extract data using regex
def extract_data_using_regex(text):
    down_match = re.search(down_number_regex, text)
    down_number = down_match.group(1) if down_match else None

    yardline_match = re.search(yardline_regex, text)
    yardline = yardline_match.group(0) if yardline_match else None

    yards_to_go_match = re.search(yards_to_go_regex, text)
    yards_to_go = yards_to_go_match.group(1) if yards_to_go_match else None

    play_type_match = re.search(play_type_regex, text)
    play_type = play_type_match.group(0) if play_type_match else None

    yards_gained_match = re.search(yards_gained_regex, text)
    yards_gained = yards_gained_match.group(0) if yards_gained_match else None

    return down_number, yardline, yards_to_go, play_type, yards_gained

# Function to fill missing data using BERT
def fill_using_bert(text, mask="<MASK>"):
    # Prepare input with masked information (truncate if necessary)
    input_text = text.replace("<MASK>", mask)
    inputs = tokenizer(input_text, return_tensors='pt', truncation=True, max_length=512)

    # Make prediction with BERT
    with torch.no_grad():
        outputs = model(**inputs)
        predictions = outputs.logits

    # Get predicted token
    predicted_token_id = torch.argmax(predictions[0, inputs['input_ids'].shape[1] - 1]).item()
    predicted_token = tokenizer.decode(predicted_token_id)

    return predicted_token

# Process text files in the input folder
for filename in os.listdir(input_folder):
    if filename.endswith(".txt"):
        filepath = os.path.join(input_folder, filename)

        # Read the file content
        with open(filepath, "r") as file:
            content = file.read()

        # Split text into chunks by blank lines
        chunks = [chunk.strip() for chunk in content.split("\n\n") if chunk.strip()]

        # Initialize a list to store all play data for this file
        all_plays = []

        # Process each chunk
        for i, chunk in enumerate(chunks):
            down_number, yardline, yards_to_go, play_type, yards_gained = extract_data_using_regex(chunk)

            # Fill missing values using BERT
            if not down_number:
                down_number = fill_using_bert(chunk, mask="down number")
            if not yardline:
                yardline = fill_using_bert(chunk, mask="yardline")
            if not yards_to_go:
                yards_to_go = fill_using_bert(chunk, mask="yards to go")
            if not play_type:
                play_type = fill_using_bert(chunk, mask="play type")
            if not yards_gained:
                yards_gained = fill_using_bert(chunk, mask="yards gained")

            # Create play data object
            play_data = {
                "transcript": chunk,
                "down": down_number or "unknown",
                "yrdln": yardline or "unknown",
                "play_type": play_type or "unknown",
                "yards_gained": yards_gained or "unknown",
            }

            # Append play data to the list
            all_plays.append(play_data)

        # Save all plays to a single JSON file
        output_filename = f"{os.path.splitext(filename)[0]}.json"
        output_filepath = os.path.join(output_folder, output_filename)

        with open(output_filepath, "w") as json_file:
            json.dump(all_plays, json_file, indent=4)

print("Processing complete. JSON files saved to:", output_folder)


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


KeyboardInterrupt: 

Takes about 3.5 minutes to process a single game.

Will take about 50 hours to process in total

**Some Problems:**

It is not picking up the downs. I either need to not base it off of the blank line OR I need to refine the splitting up of plays somehow.

In [None]:
    {
        "transcript": "first down at the ranked by both teams combined down Jamal Lewis and Barrow was there boy what a rookie rusher Jamal Lewis has been came out of Tennessee firstround draft pick of the Baltimore Ravens over 1,300 yards rushing and six touchdowns 16 hundred and 60 total yards this year for Baltimore what a tough thing for a rookie come in Greg to have all those carries and with his running style makes it even more impressive because he doesn't like run away from hits he looks for that punishment likes to deliver the blow the defender and that's why he's such a good runner now rules it out of bounds with a little violence courtesy of the New York giant now the one thing you got to do Trent Dilfer trying to find a wide receiver open down the field but try not to take these extra hits it begin Jessie Armstead and Michael Barrow get to Trent Dilfer that time but what I was going to say earlier Greg Trent Dilfer",
        "down": ".",
        "yrdln": ".",
        "play_type": "running",
        "yards_gained": "300 yards"
    }

Problem: Commentators talking about year long play stats.

Solution: Limit number to 100 for yards_gained

In [None]:
    {
        "transcript": "third down throwing is by far the hardest thing for quarterback let's see what it looks like to Trent Dilfer defensive lineman in front of him we go around steps up nice throw in Lane finds the open spot and picks out Patrick Johnson on the outside the third",
        "down": ".",
        "yrdln": ".",
        "play_type": "throw",
        "yards_gained": "picks out"
    }

Problem: Why did this get picked up?

In [None]:
    {
        "transcript": "first down barbar up together and to the 42yard line before he is brought down Cory Harris Ray Lewis with the tackle but the Giants will have to kick it away well the Giants changed field position they've moved it down they got a chance to pin the Baltimore Ravens to see deep defense you got to go out there do your job and get the football back to the to your offense in good field position Brad Maynard will kick it away Jermaine Lewis stands at his own 10 yard line Bearcats ball for and made at the 11 or 12",
        "down": ".",
        "yrdln": "42",
        "play_type": "kick",
        "yards_gained": "42yard"
    },

Problem: Confusing yardline and yards gained

Solution: Refine regex, create a few more rules

In [None]:
    {
        "transcript": "first down its defensive back Ramos MacDonald from the seven and a half yard line the Giants on",
        "down": ".",
        "yrdln": "half",
        "play_type": ".",
        "yards_gained": "half yard"
    }

Problem: Conflating of yardline and yards gained as well as, word description of yard and a half

Solution: refine yardline and yards_gained regex

In [None]:
    {
        "transcript": "third down is by far the hardest thing for quarterback the National Football League when you play for a team that runs it like Baltimore does he's talking to ghosts I really enjoy throwing the football on third downs and they'll just go well you really have settled into that job up there in Baltimore worldwide was here flips no further staff the Giants told us this week they liked the matchup of Michael Strahan on the right side against Harry Swain well we actually asked the Baltimore Ravens what scares you in the football game they said hey Harry Swain against Michael Strahan so we expected to see him doubleteam most the time that time he was singled up he gets a sack Tiki Barber to take the",
        "down": ".",
        "yrdln": ".",
        "play_type": "runs",
        "yards_gained": "gets a"
    }

Solution: Refine "gets" part of the yards gained regex

In [None]:
    {
        "transcript": "punt which rolls off the side of the foot that takes a Baltimore bounce and is down at the 40yard line barber and canola behind Collins off",
        "down": ".",
        "yrdln": "40",
        "play_type": "punt",
        "yards_gained": "40yard"
    }

Solution: Create a rule where punt and field goal leaves the yards_gained slot, down, empty

In [None]:
    {
        "transcript": "first down vol Louis we will talk from time to time tonight about a vision if you're wondering what it is this is what it looks like well we'll give you some shots of as the game goes along watch Gary Collins he got a pass Russ quarterback he drops back look he's he's a big Lane look at that big Lane so he steps up into it knows he can't stand there too long because defensive lineman from behind they're gonna come after he picks up a few yards Collins looks on as Dilfer looks at a second at 11 now this way and slipping down to the turf is Jamal Lewis incomplete it'll be third in 11 well the one thing Trent Dilfer he has a very strong arm he can throw it down the field you know he's one of the top 10 players in the league and you say throw it down 15 yards or farther but his weakness short passes look at that Jamal Lewis is going to run four big yards Trent Dilfer with not very good touch on the football throws it short dopher one out of four for just four yards and faced with third and long here he needs the Giants 41 yard line movement right in the center of the line and let's see if the Giants jumped or if the Ravens move first encroachment 75 he pins five yards still",
        "down": ".",
        "yrdln": "41",
        "play_type": "pass",
        "yards_gained": "picks up"
    },

Problems: Lots of problem with this one..... line 45

In [None]:
    {
        "transcript": "first down we expected defense to come to the floor this is the third threeandout well we can expect a lot of three announced today Greg and you saw marvin lewis the defensive coordinator again what happens is you just got to settle in and get used to how good the defense is make your adjustment and call plays that gives you a better chance to succeed Brad Maynard to kick it away to Jermaine Lewis bouncing autobahns just across midfield Lewis and gash in the backfield behind Trent Dilfer up",
        "down": ".",
        "yrdln": ".",
        "play_type": "kick",
        "yards_gained": "."
    },

Problem: Add "three and out", "threeandout", "three andout", "threeanout", "threean out", threean out",