Checklist

- try 2-3 different modelling methods with appropriate evaluation/metrics
  - understand when to apply which types of models
    - dicsuss in notebook: regression task . . . supervised
  - identify one of the tested models as the best
  
*detailed*
 - "try number of different methods" --> repeat RSCV with various solvers, can do two rounds on this notebook
 - review model outcomes and iterate if needed
 - identify "best" model, note that it is not most performant (can include GradientBoosting to show tradeoff)
 - review target feature, results.

**Load Packages, Data**

In [1]:
# data, numbers
import pandas as pd
import numpy as np
from scipy import stats
# plotting
import seaborn as sns
import matplotlib.pyplot as plt
# utilities
from time import time
from tqdm import tqdm
from pathlib import Path

# Modeling and related from Scikit-Learn and XGBoost
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error, mean_absolute_error
# specific models
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import GradientBoostingRegressor
import xgboost as xgb

In [2]:
data = pd.read_parquet(Path('data/game_stats.parquet'))
data.shape

(1521, 20)

**Split Data, Scale**
  - train/test and validate
    - use train/test to tune models
    - use validate to test models
    
Increase size of validation set? Adjust CV folds for training/testing sets?

In [None]:
X = data.iloc[:,6:-2]
y1 = data['home_margin'] # target for regression
y2 = data['home_win'] # target for classification

# Initial Split, this test set will be hold out for scoring
X_train, X_test, y1_train, y1_test = train_test_split(X, y1, test_size=0.2)
y2_train = y2[y1_train.index]
y2_test = y2[y1_test.index]
features = X_train.columns
print(X_train.shape)
print(X_test.shape)

In [None]:
# Scale data, fit scaler to training data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

**Parameter Tuning**

In [None]:
def_models = {
    'Ridge': Ridge(), # 2s
    'xgb_linear': xgb.XGBRegressor(booster="gblinear"), # 5s
    'xgb_tree': xgb.XGBRegressor(booster="gbtree"), # 15ss
    'gradient_boosting': GradientBoostingRegressor(), # 10 min
}

parameters = {
    # defaults: alpha=1 [0,inf) | tol=1e-4
    'Ridge': {'rscv_iterations':150, 'alpha':stats.loguniform(0.001,5050,0,1), 'tol':stats.loguniform(1e-4,1e0,0,1), 'solver':['saga']},
    
    # eta/learning_rate, gamma/min_split_loss, max_depth
    # defaults: eta=0.3, gamma=0, max_depth=6, 
    'xgb_tree': {'rscv_iterations':45, 'eta':stats.beta(3,6,0,1), 'gamma':stats.uniform(0,3), 'max_depth':stats.binom(4,0.5,2),},

}

In [None]:
# use pipelines

ks = [val+1 for val in range(12)]
kb_pipe = make_pipeline(SelectKBest(score_func=f_regression, k=12), LinearRegression())
grid_params = {
    'selectkbest__k': ks,
    'linearregression__positive': [False, True]
}