# Regression model

First, let's try a regression model. We can use a few weeks of stats for a given player as input, and try to predict the next week's stats as the output. To get an estimate of model performance we can make predictions for all of the players in all of the years and/or even break up each season into batches.

In [1]:
# Standard library imports
from typing import Tuple

# PyPI imports
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import root_mean_squared_error

# Input data
data_file='../data/parsed_qb_data.parquet'

## 1. Data loading

In [2]:
data_df=pd.read_parquet(data_file)
data_df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 17938 entries, (np.int64(1996), 'Brett Favre', np.int64(1)) to (np.int64(2024), 'Drake Maye', np.int64(18))
Data columns (total 15 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Rank    17938 non-null  object
 1   Exp     17938 non-null  object
 2   G       17938 non-null  object
 3   Cmp     17938 non-null  object
 4   Att     17938 non-null  object
 5   Cm%     17938 non-null  object
 6   PYd     17938 non-null  object
 7   Y/Att   17938 non-null  object
 8   PTD     17938 non-null  object
 9   Int     17938 non-null  object
 10  Rsh     17938 non-null  object
 11  RshYd   17938 non-null  object
 12  RshTD   17938 non-null  object
 13  FP/G    17938 non-null  object
 14  FantPt  17938 non-null  object
dtypes: object(15)
memory usage: 2.1+ MB


## 2. Data cleaning

Next, we need to set the datatype for our features and then standardize.

In [3]:
# Replace empty strings with NAN
data_df.replace('', pd.NA, inplace=True)

# Drop NAN containing rows
data_df.dropna(inplace=True)

# Set float dtype for all features
data_df=data_df.astype(float)

## 3. Data generator function

To test model performance we need a function to yield batches of data for regression modeling.

In [4]:
def generate_data(data_df: pd.DataFrame, input_window: int) -> Tuple[np.array, np.array]:
    '''Takes dataframe, input window size, parses data into feature label pairs,
    returns as tuple of numpy arrays'''

    # Get list of seasons
    seasons=data_df.index.get_level_values('Season').unique().tolist()

    features=[]
    labels=[]

    # Loop on seasons
    for season in seasons:

        # Extract the data for this season
        season_df=data_df.loc[(season)]
        
        # Get the list of player for this season
        players=season_df.index.get_level_values('Player').unique().tolist()

        # loop on the players
        for player in players:

            # Extract the data for this player
            player_df=season_df.loc[(player)]

            # Indexing variable for batch
            input_start_index=0

            # Loop on the player data
            while input_start_index + input_window + 1 < len(player_df):

                # Extract and collect the features and labels
                feature_row=player_df.iloc[input_start_index:input_start_index + input_window]
                label_row=player_df.iloc[input_start_index + input_window]
                features.append(feature_row.values.tolist())
                labels.append(label_row.values.tolist())

                # Update the index
                input_start_index+=input_window + 1

    # Convert to numpy arrays
    features=np.array(features)
    labels=np.array(labels)

    # Squeeze out the extra dimension for window width of 1
    if input_window == 1:
        features=features.squeeze(axis=1)

    return features, labels

## 4. Training/testing data preparation

In [5]:
# Generate some feature, label pairs
input_window=1
features, labels=generate_data(data_df, input_window)

# Split them into training and validation
training_features, testing_features, training_labels, testing_labels=train_test_split(features, labels)

# Scale the data
scaler=StandardScaler()
scaler.fit(training_features)
training_features=scaler.transform(training_features)
training_labels=scaler.transform(training_labels)
testing_features=scaler.transform(testing_features)
testing_labels=scaler.transform(testing_labels)

print(f'Features: {training_features.shape}')
print(f'Labels: {training_labels.shape}')

Features: (5256, 15)
Labels: (5256, 15)


## 5. Multiple linear regression model

In [6]:
# Fit the model on the training data and make predictions
# for the testing data
model=LinearRegression().fit(training_features, training_labels)
predictions=model.predict(testing_features)

# Un-scale the predictions and testing labels
testing_labels=scaler.inverse_transform(testing_labels)
predictions=scaler.inverse_transform(predictions)

# Calculate feature-wise RMSE
for feature, i in zip(data_df.columns, range(testing_labels.shape[1])):
    rmse=root_mean_squared_error(predictions[:,i], testing_labels[:,i])
    print(f'{feature} RMSE: {rmse}')
