# Before we do anything:

We must get the rest of the data and split them into training and testing batches.

### Plan of action:
- Download the play_by_play data for each year in CSV format
- Download the injury data for each year in CSV format
- Put all of it in the data directory
- Create a Python script that uses all of the preprocessing (cleaning, constructing, formatting) sections of Data Prep
- Run the Python script on each year for the play_by_play and injury data
- Separate, based on the limits of the data
    - Training data: 2009 - 2017
    - Testing data: 2018 - 2024
- Make a training and testing directory for each

## Imports

In [4]:
import os
import subprocess

## Split into training/testing batches

In [12]:
# Loop through each year for training dataset (2009 - 2017)
for year in range(2009, 2018):
    # Define file paths
    play_by_play_file = f"{os.getcwd()}/data/play_by_play_{year}.csv"
    injuries_file = f"{os.getcwd()}/data/injuries_{year}.csv"

    command = ["./format_data.py", play_by_play_file, injuries_file, "training", str(year)]
    try:
        subprocess.run(command, check=True)
        print(f"Processed files for year {year}")
    except subprocess.CalledProcessError as e:
        print(f"Error processing files for year {year}: {e}")

total rows in injury_df_cleaned:  4549
Counts of Primary Injury Data:
+----+--------------------+-------+
|    |   Primary Injury   | Count |
+----+--------------------+-------+
| 0  |        Knee        | 1038  |
| 1  |       Ankle        |  683  |
| 2  |      Shoulder      |  362  |
| 3  |     Hamstring      |  334  |
| 4  |        Foot        |  252  |
| 5  |       Groin        |  183  |
| 6  |        Back        |  177  |
| 7  |      Illness       |  152  |
| 8  |        Calf        |  112  |
| 9  |     Concussion     |  97   |
| 10 |        Hip         |  92   |
| 11 |        Neck        |  87   |
| 12 |       Thumb        |  76   |
| 13 |     Quadricep      |  76   |
| 14 |        Toe         |  67   |
| 15 |       Thigh        |  66   |
| 16 |       Elbow        |  55   |
| 17 |        Head        |  55   |
| 18 |       Wrist        |  50   |
| 19 |        Hand        |  47   |
| 20 |       Chest        |  41   |
| 21 |        Rib         |  35   |
| 22 |        Ribs        |  3

In [14]:
# Loop through each year for testing dataset (2018 - 2024)
for year in range(2018, 2025):
    # Define file paths
    play_by_play_file = f"{os.getcwd()}/data/play_by_play_{year}.csv"
    injuries_file = f"{os.getcwd()}/data/injuries_{year}.csv"

    command = ["./format_data.py", play_by_play_file, injuries_file, "testing", str(year)]
    try:
        subprocess.run(command, check=True)
        print(f"Processed files for year {year}")
    except subprocess.CalledProcessError as e:
        print(f"Error processing files for year {year}: {e}")

total rows in injury_df_cleaned:  2430
Counts of Primary Injury Data:
+----+--------------------+-------+
|    |   Primary Injury   | Count |
+----+--------------------+-------+
| 0  |        Knee        |  469  |
| 1  |       Ankle        |  345  |
| 2  |     Hamstring      |  314  |
| 3  |      Shoulder      |  168  |
| 4  |        Foot        |  162  |
| 5  |     Concussion     |  131  |
| 6  |       Groin        |  112  |
| 7  |        Calf        |  86   |
| 8  |        Back        |  82   |
| 9  |        Hip         |  64   |
| 10 |      Illness       |  53   |
| 11 |        Toe         |  46   |
| 12 |        Neck        |  43   |
| 13 |     Quadricep      |  38   |
| 14 |       Elbow        |  32   |
| 15 |        Heel        |  30   |
| 16 | Not Injury Related |  23   |
| 17 |       Chest        |  19   |
| 18 |       Thumb        |  19   |
| 19 |       Thigh        |  18   |
| 20 |        Rib         |  17   |
| 21 |        Shin        |  15   |
| 22 |        Hand        |  1