### Data Preprocessing and Feature Selection

This part performs data preprocessing and feature selection. We use L1 regularization along with Logistic Regression CV to select important features.

For data preprocessing, we perform shuffle and data engineering. The functions are included in the Processors package under the same directory.

In [1]:
import pandas as pd
import numpy as np

from Processors.missing_value_processor import ratio
from Processors.feature_engieering_processor import feature_engineering
from Processors.shuffle_processor import shuffle
from Processors.get_feature_names_processor import get_feature_names
from Processors.diff_processor import diff_processor
from Statistical_Analysis.ExtraTree_Classifier import select_top_features


In [2]:
# Read the Data
train = pd.read_csv("train_2003_2022.csv").query("season!=2018").reset_index(drop=True)
test = pd.read_csv("train_2003_2022.csv").query("season==2018").reset_index(drop=True)
# Shuffle the data since Team_1 win all the games in the dataset
train = shuffle(train, 600)
test = shuffle(test, 32)
# Constructing new features
train = feature_engineering(train)
test = feature_engineering(test)
# Decrease missing values
train = ratio(train)
test = ratio(test)

In [3]:
train = diff_processor(train)
test = diff_processor(test)

In [4]:
y_train = train['team1_win']
X_train = train.drop(columns=['team1_win'])

y_test = test['team1_win']
X_test = test.drop(columns=['team1_win'])

In [5]:
features = select_top_features(X_train,y_train)
features.append('team1_keyplayers')
features.append('team2_keyplayers')

In [6]:
print(features)

['team1_ap_final', 'team2_ap_final', 'team1_adjoe', 'adjde_diff', 'team1_seed', 'exp_win1', 'seed_diff', 'team1_keyplayers', 'team2_keyplayers']


### Baseline

The following part we perform several machine learning methods to train the historical data.

Note: The competition uses log-loss as metric. For tuning the hyperparameters, go to the baseline package.

In [7]:
from sklearn.model_selection import train_test_split
from sklearn import metrics
from Baseline.lightGBM import lightGBM
from Baseline.catboost import catboost
from Baseline.randomforest import random_forest

In [8]:
# Redine X for baseline training since we have performed feature selection
# features = ['team1_seed','seed_diff','adjoe_diff','adjde_diff','stlrate_diff',
           # 'pt_team_season_wins_diff','oppftpct_diff','arate_diff','pt_overall_ncaa_diff','pt_team_season_losses_diff','oppstlrate_diff','oe_diff','pt_school_s16_diff','exp_win2','exp_win1','team2_long','team2_lat','num_ot']

X_train = X_train[features]

X_test = X_test[features]
# y does not change

In [9]:
LGBM = lightGBM(X_train, y_train)

In [10]:
CAT = catboost(X_train, y_train)

0:	learn: 0.6910703	total: 146ms	remaining: 4.24s
1:	learn: 0.6892772	total: 147ms	remaining: 2.06s
2:	learn: 0.6870679	total: 147ms	remaining: 1.33s
3:	learn: 0.6852511	total: 148ms	remaining: 962ms
4:	learn: 0.6832132	total: 149ms	remaining: 743ms
5:	learn: 0.6814082	total: 149ms	remaining: 597ms
6:	learn: 0.6796287	total: 150ms	remaining: 492ms
7:	learn: 0.6779312	total: 150ms	remaining: 414ms
8:	learn: 0.6762601	total: 151ms	remaining: 352ms
9:	learn: 0.6744784	total: 152ms	remaining: 303ms
10:	learn: 0.6726796	total: 152ms	remaining: 263ms
11:	learn: 0.6707896	total: 153ms	remaining: 229ms
12:	learn: 0.6693417	total: 153ms	remaining: 200ms
13:	learn: 0.6677094	total: 154ms	remaining: 175ms
14:	learn: 0.6660789	total: 154ms	remaining: 154ms
15:	learn: 0.6643220	total: 154ms	remaining: 135ms
16:	learn: 0.6628891	total: 155ms	remaining: 118ms
17:	learn: 0.6614015	total: 155ms	remaining: 104ms
18:	learn: 0.6599753	total: 156ms	remaining: 90.2ms
19:	learn: 0.6584217	total: 156ms	remain

In [11]:
Random_Forest = random_forest(X_train, y_train)

### Performance

In the following, we test the performance of each baseline based on log-loss and AUC.

For each baseline, we use train_test_split method to test the performance.


In [12]:
LGBM_pred = LGBM.predict_proba(X_test)[:,1]

In [13]:
print("log_loss for LightGBM is: ", metrics.log_loss(y_test,LGBM_pred))
print("roc_auc for LightGBM is: ", metrics.roc_auc_score(y_test,LGBM_pred))
print("Accuracy for LightGBM is: ", metrics.accuracy_score(y_test, LGBM.predict(X_test)))

log_loss for LightGBM is:  0.6378990650271701
roc_auc for LightGBM is:  0.7375
Accuracy for LightGBM is:  0.6865671641791045


In [14]:
CAT_pred = CAT.predict_proba(X_test)[:,1]

In [15]:
print("log_loss is: ", metrics.log_loss(y_test,CAT_pred))
print("roc_auc is: ", metrics.roc_auc_score(y_test,CAT_pred))
print("Accuracy for CAT is: ",metrics.accuracy_score(y_test, CAT.predict(X_test)))

log_loss is:  0.6243125230346609
roc_auc is:  0.7294642857142857
Accuracy for CAT is:  0.6865671641791045


In [16]:
Random_Forest_pred = Random_Forest.predict_proba(X_test)[:,1]

In [17]:
print("log_loss for Random Forest is: ", metrics.log_loss(y_test,Random_Forest_pred))
print("roc_auc for Random Forest is: ", metrics.roc_auc_score(y_test,Random_Forest_pred))
print("Accuracy for Random Forest is: ", metrics.accuracy_score(y_test, Random_Forest.predict(X_test)))

log_loss for Random Forest is:  0.6276828396866733
roc_auc for Random Forest is:  0.7258928571428571
Accuracy for Random Forest is:  0.6865671641791045


### Ensemble Model