### Data Preprocessing and Feature Selection

This part performs data preprocessing and feature selection. We use L1 regularization along with Logistic Regression CV to select important features.

For data preprocessing, we perform shuffle and data engineering. The functions are included in the Processors package under the same directory.

In [1]:
import pandas as pd
import numpy as np

from Processors.missing_value_processor import ratio
from Processors.feature_engieering_processor import feature_engineering
from Processors.shuffle_processor import shuffle
from Processors.get_feature_names_processor import get_feature_names
from Processors.diff_processor import diff_processor
from Statistical_Analysis.Chi_Squared_Processor import select_top_features


In [2]:
# Read the Data
df = pd.read_csv("train_2003_2022.csv")

# Shuffle the data since Team_1 win all the games in the dataset
df = shuffle(df, 600)

# Constructing new features
df = feature_engineering(df)

# Decrease missing values
df = ratio(df)

In [3]:
df = diff_processor(df)

In [4]:
y = df['team1_win']
X = df.drop(columns=['team1_win'])

numeric = X.select_dtypes(include=['float', 'int64', 'int32', 'int']).columns.tolist()

categorical = X.drop(columns = numeric).columns.tolist()

In [5]:
features = select_top_features(X,y)
features.append('team1_keyplayers')
features.append('team2_keyplayers')

In [6]:
print(features)

['team2_seed', 'team1_ap_final', 'pt_overall_ncaa_diff', 'adjde_diff', 'adjoe_diff', 'pt_school_s16_diff', 'pt_team_season_wins_diff', 'team1_seed', 'team1_keyplayers', 'team2_keyplayers']


### Baseline

The following part we perform several machine learning methods to train the historical data.

Note: The competition uses log-loss as metric. For tuning the hyperparameters, go to the baseline package.

In [7]:
from sklearn.model_selection import train_test_split
from sklearn import metrics
from Baseline.lightGBM import lightGBM
from Baseline.catboost import catboost
from Baseline.randomforest import random_forest

In [8]:
# Redine X for baseline training since we have performed feature selection
# features = ['team1_seed','seed_diff','adjoe_diff','adjde_diff','stlrate_diff',
           # 'pt_team_season_wins_diff','oppftpct_diff','arate_diff','pt_overall_ncaa_diff','pt_team_season_losses_diff','oppstlrate_diff','oe_diff','pt_school_s16_diff','exp_win2','exp_win1','team2_long','team2_lat','num_ot']

X = df[features]
# y does not change

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=1)

In [10]:
LGBM = lightGBM(X_train, y_train)

In [11]:
CAT = catboost(X_train, y_train)

0:	learn: 0.6910223	total: 135ms	remaining: 3.93s
1:	learn: 0.6890797	total: 136ms	remaining: 1.91s
2:	learn: 0.6870661	total: 137ms	remaining: 1.23s
3:	learn: 0.6850406	total: 137ms	remaining: 893ms
4:	learn: 0.6832331	total: 138ms	remaining: 690ms
5:	learn: 0.6811811	total: 139ms	remaining: 554ms
6:	learn: 0.6793529	total: 139ms	remaining: 457ms
7:	learn: 0.6770607	total: 140ms	remaining: 384ms
8:	learn: 0.6747923	total: 140ms	remaining: 327ms
9:	learn: 0.6727963	total: 141ms	remaining: 282ms
10:	learn: 0.6706020	total: 141ms	remaining: 244ms
11:	learn: 0.6687008	total: 142ms	remaining: 213ms
12:	learn: 0.6667237	total: 142ms	remaining: 186ms
13:	learn: 0.6651890	total: 143ms	remaining: 163ms
14:	learn: 0.6633206	total: 143ms	remaining: 143ms
15:	learn: 0.6614709	total: 144ms	remaining: 126ms
16:	learn: 0.6596379	total: 144ms	remaining: 110ms
17:	learn: 0.6579588	total: 145ms	remaining: 96.6ms
18:	learn: 0.6564302	total: 145ms	remaining: 84.1ms
19:	learn: 0.6546867	total: 146ms	remai

In [12]:
Random_Forest = random_forest(X_train, y_train)

### Performance

In the following, we test the performance of each baseline based on log-loss and AUC.

For each baseline, we use train_test_split method to test the performance.


In [13]:
LGBM_pred = LGBM.predict_proba(X_test)[:,1]

In [14]:
print("log_loss for LightGBM is: ", metrics.log_loss(y_test,LGBM_pred))
print("roc_auc for LightGBM is: ", metrics.roc_auc_score(y_test,LGBM_pred))
print("Accuracy for LightGBM is: ", metrics.accuracy_score(y_test, LGBM.predict(X_test)))

log_loss for LightGBM is:  0.5693992920753588
roc_auc for LightGBM is:  0.7694129697202483
Accuracy for LightGBM is:  0.672


In [15]:
CAT_pred = CAT.predict_proba(X_test)[:,1]

In [16]:
print("log_loss is: ", metrics.log_loss(y_test,CAT_pred))
print("roc_auc is: ", metrics.roc_auc_score(y_test,CAT_pred))
print("Accuracy for CAT is: ",metrics.accuracy_score(y_test, CAT.predict(X_test)))

log_loss is:  0.5663441725588816
roc_auc is:  0.7717815760834774
Accuracy for CAT is:  0.724


In [17]:
Random_Forest_pred = Random_Forest.predict_proba(X_test)[:,1]

In [18]:
print("log_loss for Random Forest is: ", metrics.log_loss(y_test,Random_Forest_pred))
print("roc_auc for Random Forest is: ", metrics.roc_auc_score(y_test,Random_Forest_pred))
print("Accuracy for Random Forest is: ", metrics.accuracy_score(y_test, Random_Forest.predict(X_test)))

log_loss for Random Forest is:  0.572934674570584
roc_auc for Random Forest is:  0.7648037897701812
Accuracy for Random Forest is:  0.7


### Ensemble Model