## Modeling 
The goal of this notebook is to confirm what model will be used for the offensive coordinator and ensure it can be effectively implemented with the data from 01_exploration before preprocessing steps are ported to C++. Eventually the modeling steps in this notebook will be done in a train.py file.   

**Tasks:**  
1. Load Cleaned Data
2. Model Data
3. Evaluate Model

**Note For Part 2**  
The two types of models that will be compared/evalutated for use in the final training script are XGBoost and Random Forest. For this kind of classification problem which includes nonlinear interactions and many contextual dependencies it is likely that XGBoost will perform better due to its ability to handle nonlinear/contextual relationships better and catch rare but important situations. (Example: hail mary attempts when losing in the last minute of a football game)

**1. Load Cleaned Data**

In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier

In [4]:
df = pd.read_csv("C:/Users/Tyler/OneDrive/Coding/Repositories/AI_Offensive_Coordinator-/python/data/cleaned_pbp_22_to_24.csv")

In [5]:
df.head()

Unnamed: 0,season,week,game_id,posteam,defteam,qtr,down,ydstogo,yardline_100,score_differential,def_pass_rate_last3,avg_yds_pp_alwd_last3,avg_rush_pp_alwd_last3,play_subtype
0,2022,1,2022_01_BAL_NYJ,NYJ,BAL,1.0,1.0,10.0,78.0,0.0,0.65625,5.358173,4.320055,outside run
1,2022,1,2022_01_BAL_NYJ,NYJ,BAL,1.0,1.0,10.0,59.0,0.0,0.65625,5.358173,4.320055,short pass left
2,2022,1,2022_01_BAL_NYJ,NYJ,BAL,1.0,2.0,10.0,59.0,0.0,0.65625,5.358173,4.320055,outside run
3,2022,1,2022_01_BAL_NYJ,NYJ,BAL,1.0,3.0,5.0,54.0,0.0,0.65625,5.358173,4.320055,short pass right
4,2022,1,2022_01_BAL_NYJ,BAL,NYJ,1.0,1.0,10.0,72.0,0.0,0.488917,5.883399,3.73262,short pass right


**2. Model Data**

Random Forest will be the baseline model that will be compared to XGBoost.

In [8]:
df['play_subtype'].unique()

array(['outside run', 'short pass left', 'short pass right', 'inside run',
       'deep pass right', 'short pass middle', 'deep pass left',
       'deep pass middle'], dtype=object)

In [9]:
df['play_subtype'].value_counts()

play_subtype
outside run          20669
short pass right     12589
short pass left      11269
inside run            6977
short pass middle     6829
deep pass right       2731
deep pass left        2630
deep pass middle      1359
Name: count, dtype: int64

In [18]:
#defining and target components of the dataset
X = df[['season', 'week', 'defteam', 'qtr', 'down', 'ydstogo', 'yardline_100', 'score_differential', 'def_pass_rate_last3', 'avg_yds_pp_alwd_last3', 'avg_rush_pp_alwd_last3']]
y = df['play_subtype']

In [20]:
#getting train/test/validation splits 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 42)

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size = .25, random_state = 42)

In [22]:
#encoding target variables
encoder = LabelEncoder()

y_train = encoder.fit_transform(y_train)
y_val = encoder.transform(y_val)
y_test = encoder.transform(y_test)
