# Overview
This is an explorative study on NCAA Men’s Datasets. Our focus is on the odd for an underdog to defeat a higher ranking team in NCAA tournaments. Public rating info since 2002-2003 season was provided by Kenneth Massey on the competition page; therefore, this project covers the trounaments from 2003 through 2019. The rankings right before the trounaments, usually on Day 133 since the season beginning, are the fundations to decide which team is an underdog. After data engineering and data creation, I'll create a stacking ensamble model by following Arthur Tok's steps in his famous Notebook(https://www.kaggle.com/arthurtok/introduction-to-ensembling-stacking-in-python).  

# Data Engineering

Before we dive into the amazing modeling magics, feature engineering is always worth the sweats if fruitful outcomes are desired. We are going to exam the list of data sets provided by the site and trying to form the useful features by masagging the data. 
1. Team Data
2. Season Info
3. Seeds Info
4. Regular Season Results: Year by Year, Game Details
5. Public Rating

In [None]:
import gc
import matplotlib.pylab as plt
plt.style.use('seaborn-dark-palette')
import numpy as np
import os
import pandas as pd
from sklearn.preprocessing import StandardScaler

DIR = '../input/google-cloud-ncaa-march-madness-2020-division-1-mens-tournament/'

Team data presents the different college teams and each school is uniquely identified by a 4 digit id number.

In [None]:
#Team Data
MTeams = pd.read_csv(f'{DIR}/MDataFiles_Stage1/MTeams.csv')
print(MTeams.shape)
print(MTeams.isnull().sum())
MTeams.head()

Season file identifies when the season began and certain season-level properties.

In [None]:
#Season info
MSeasons = pd.read_csv(f'{DIR}/MDataFiles_Stage1/MSeasons.csv')
print(MSeasons.shape)
print(MSeasons.isnull().sum())
MSeasons.head()

The first letter of Seed indicates which region the team was in and the next two digits tell you the seed within the region. For play-in teams, there is a fourth character (a or b) to further distinguish the seeds, since teams that face each other in the play-in games will have seeds with the same first three characters. The "a" and "b" are assigned based on which Team ID is lower numerically. Therefore, Variable 'SeedConference' and 'SeedOrder' are created to indicate Conference and Seed respectively.   

In [None]:
#Seeds Info
#separate the seeds and the conferences
MNCAATourneySeed = pd.read_csv(f'{DIR}/MDataFiles_Stage1/MNCAATourneySeeds.csv')
print(MNCAATourneySeed.shape)
print(MNCAATourneySeed.isnull().sum())
MNCAATourneySeeds = MNCAATourneySeed.merge(MTeams, how = 'left', left_on='TeamID', right_on='TeamID')
MNCAATourneySeeds['SeedConference'] = 'Region'+MNCAATourneySeeds['Seed'].str.slice(stop=1)
MNCAATourneySeeds['SeedOrder'] = MNCAATourneySeeds['Seed'].str.slice(start=1, stop=3).astype(int)
MNCAATourneySeeds.head(5)

In [None]:
#double check to see if the seeds are in the rage of 1 and 16.
MNCAATourneySeeds['SeedOrder'].value_counts()

Regular Seaon file identifies the game-by-game results since the 1985 season. We are going to bring team name to the data set and calculate the win rate of the season for each team.

In [None]:
#Regular Season --Team Year by Year
MRegularSeasonCompactResult = pd.read_csv(f'{DIR}/MDataFiles_Stage1/MRegularSeasonCompactResults.csv')
MRegularSeasonCompactResults = MRegularSeasonCompactResult.merge(MTeams[['TeamName','TeamID']], how='left', left_on='WTeamID', right_on='TeamID')\
                                .drop('TeamID', axis=1)\
                                .rename(columns={"TeamName":"WTeamName"})\
                                .merge(MTeams[['TeamName','TeamID']], how='left', left_on='LTeamID', right_on='TeamID')\
                                .drop('TeamID', axis=1)\
                                .rename(columns={"TeamName":"LTeamName"})
freq_win_yr = MRegularSeasonCompactResults.groupby(['Season','WTeamID','WTeamName'])['WTeamID'].count().sort_values(ascending=False)
freq_lose_yr = MRegularSeasonCompactResults.groupby(['Season','LTeamID','LTeamName'])['LTeamID'].count().sort_values(ascending=False)
MRegularSeasonTeamResultsYr = pd.concat([freq_win_yr, freq_lose_yr], axis=1)
print(MRegularSeasonTeamResultsYr.shape)
MRegularSeasonTeamResultsYr.fillna(0,inplace=True)
MRegularSeasonTeamResultsYr.index.set_names(['Season','TeamID','TeamName'],inplace=True)
MRegularSeasonTeamResultsYr.rename(columns={'WTeamID':'win','LTeamID':'loss'}, inplace=True)
MRegularSeasonTeamResultsYr['compact']=MRegularSeasonTeamResultsYr['win'] + MRegularSeasonTeamResultsYr['loss']
MRegularSeasonTeamResultsYr['WinRate'] = MRegularSeasonTeamResultsYr['win']/MRegularSeasonTeamResultsYr['compact']
MRegularSeasonTeamResultsYr.reset_index(inplace=True)
MRegularSeasonTeamResultsYr.head(10)

In [None]:
MRegularSeasonTeamResultsYr.isnull().sum().to_frame(name = 'missing').T

Regular Season Detail file provides team-level box scores for many regular seasons of historical data, starting with the 2003 season. All games listed in the MRegularSeasonCompactResults file since the 2003 season should exactly be present in the MRegularSeasonDetailedResults file. As you can see, the stats are recorded with 'W' the win team and 'L' the lost team. By rearranging the records, we are going to compute the average number for each metric. We can learn the team's offense and defense capability from those statistics.

In [None]:
#Regular Season -- Game Details
MRegularSeasonDetail = pd.read_csv(f'{DIR}/MDataFiles_Stage1/MRegularSeasonDetailedResults.csv')
print(MRegularSeasonDetail.shape)
MRegularSeasonDetail.head()
print(MRegularSeasonDetail.columns)
wcol = [col for col in MRegularSeasonDetail if (col.startswith('W')) & (col !='WLoc') or (col=='Season')or (col=='DayNum') ]
print(wcol)
lcol = [col for col in MRegularSeasonDetail if col.startswith('L') or (col=='Season')or (col=='DayNum')]
print(lcol)

In [None]:
rename = [w[1:] for w in wcol if w.startswith('W')]

wteam = MRegularSeasonDetail[wcol].copy()
wteam.columns = ['Season','DayNum']+rename
wteam['result']='W'
wteam['LScore']=MRegularSeasonDetail['LScore']
print(len(wteam))

lteam = MRegularSeasonDetail[lcol].copy()
lteam.columns = ['Season','DayNum']+rename
lteam['result']='L'
lteam['LScore']=MRegularSeasonDetail['WScore']
print(len(lteam))

MRegularSeasonDetails = pd.concat([wteam, lteam])
print(len(MRegularSeasonDetails))

MRegularSeasonDetails['FG_avg'] = MRegularSeasonDetails.FGM/MRegularSeasonDetails.FGA
MRegularSeasonDetails['FG3_avg'] = MRegularSeasonDetails.FGM3/MRegularSeasonDetails.FGA3
MRegularSeasonDetails['FGM2'] = MRegularSeasonDetails.FGM-MRegularSeasonDetails.FGM3
MRegularSeasonDetails['FGA2'] = MRegularSeasonDetails.FGA-MRegularSeasonDetails.FGA3
MRegularSeasonDetails['FG2_avg'] = MRegularSeasonDetails.FGM2/MRegularSeasonDetails.FGA2
MRegularSeasonDetails['FT_avg'] = MRegularSeasonDetails.FTM/MRegularSeasonDetails.FTA
MRegularSeasonDetails['TR'] = MRegularSeasonDetails.OR + MRegularSeasonDetails.DR
MRegularSeasonDetails.head(10)

In [None]:
MRegularSeasonTeamBox = MRegularSeasonDetails.groupby(['Season','TeamID']).mean().reset_index()

MRegularSeasonTeamBox.head()

Public Rating file provides weekly team rankings for dozens of top rating systems - Pomeroy, Sagarin, RPI, ESPN, etc., since the 2002-2003 season. The medians of the latest ratings by the systeams before the tournaments are considered the final rankings.  

In [None]:
#Public Rating
MMasseyOrdinals = pd.read_csv(f'{DIR}/MDataFiles_Stage1/MMasseyOrdinals.csv')
MMasseyOrdinals.sort_values(by=['Season', 'TeamID','SystemName','RankingDayNum'], inplace=True)
MMasseyOrdinals.head()

In [None]:
prior_tourney = MMasseyOrdinals.query("RankingDayNum <= 133")
comb=prior_tourney.groupby(['Season','TeamID','SystemName']).size().reset_index().rename(columns={0:'count'})
comb.shape
max_rankingdaynum = prior_tourney.groupby(['Season','TeamID','SystemName']).agg({'RankingDayNum':'max'}).reset_index()
max_rankingdaynum.head()

In [None]:
MMasseyOrdinalsPriorTourney = prior_tourney.merge(max_rankingdaynum, how='inner', left_on=['Season','TeamID','SystemName','RankingDayNum'], right_on=['Season','TeamID','SystemName','RankingDayNum'])
print(MMasseyOrdinalsPriorTourney.shape) #has to have 307393 combos
MMasseyOrdinalsPriorTourney.query('TeamID==1102 & Season==2003')

In [None]:
MMasseyOrdinalsMedian = MMasseyOrdinalsPriorTourney.groupby(['Season','TeamID'])['OrdinalRank'].median().reset_index()
print(MMasseyOrdinalsMedian.shape)
MMasseyOrdinalsMedian.head()

# Data Creation

Our goal of this section is to form the final datasets to train our models. First, we look at the dependent variable, which is the results of the tournaments. Next, the underdogs are decided by their ranking medians. After the fields are concatenated, standardization is conducted to make the numeric attributes have a 0 mean and unit variance. Lastly, the data before year 2015 is treated as the traing set and year 2015-2019 is the test set.

1. Tournament Results
2. Underdogs
3. Training/Test sets


In [None]:
#Tournaments
MNCAATourneyCompactResult = pd.read_csv(f'{DIR}/MDataFiles_Stage1/MNCAATourneyCompactResults.csv')
print(MNCAATourneyCompactResult.shape)
print(MNCAATourneyCompactResult.isnull().sum())
MNCAATourneyCompactResults = MNCAATourneyCompactResult.merge(MTeams[['TeamName','TeamID']], how='left', left_on='WTeamID', right_on='TeamID')\
                                .drop('TeamID', axis=1)\
                                .rename(columns={"TeamName":"WTeamName"})\
                                .merge(MTeams[['TeamName','TeamID']], how='left', left_on='LTeamID', right_on='TeamID')\
                                .drop('TeamID', axis=1)\
                                .rename(columns={"TeamName":"LTeamName"})
MNCAATourneyCompactResults['Diff_Score'] = MNCAATourneyCompactResults['WScore'] - MNCAATourneyCompactResults['LScore']
MNCAATourneyCompactResults.sort_values(by='Diff_Score', ascending=False, inplace=True)
MNCAATourneyCompactResults.head()

In [None]:
# decide the underdogs and collect the fields
MNCAATourneyCompactResults.sort_values(by=['Season','DayNum'], inplace=True)

MNCAATourney_ = MNCAATourneyCompactResults.merge(MMasseyOrdinalsMedian, how='left', left_on=['Season', 'WTeamID'], right_on=['Season', 'TeamID'])
MNCAATourney = MNCAATourney_.merge(MMasseyOrdinalsMedian, how='left', left_on=['Season', 'LTeamID'], right_on=['Season', 'TeamID'])
MNCAATourney.rename(columns={'OrdinalRank_x':'WTeamRank','OrdinalRank_y':'LTeamRank', 'TeamID_x':'T1','TeamID_y':'T2'}, inplace=True)
MNCAATourney.loc[MNCAATourney['WTeamRank'] < MNCAATourney['LTeamRank'], 'T1']=MNCAATourney['LTeamID']
MNCAATourney.loc[MNCAATourney['WTeamRank'] < MNCAATourney['LTeamRank'], 'T2']=MNCAATourney['WTeamID']
MNCAATourney['label'] = np.where(MNCAATourney['T1']==MNCAATourney['WTeamID'], 1,0)
print(MNCAATourney.shape)
MNCAATourney.tail()

In [None]:
MNCAATourney[MNCAATourney.Season>=2003].head()

In [None]:
def gen_TeamBoxDict(tag):
    TeamBoxDict = { k:tag+v for (k,v) in zip(MRegularSeasonTeamBox.columns, MRegularSeasonTeamBox.columns) if (k != 'Season' and k != 'DayNum')}  
    return(TeamBoxDict)

In [None]:
def grab_col(tag,dataframe):
    df = dataframe.merge(MMasseyOrdinalsMedian, how='left', left_on=['Season', tag], right_on=['Season', 'TeamID'])\
        .rename(columns={'OrdinalRank':tag+'OrdinalRank'})\
        .merge(MNCAATourneySeeds[['Season', 'TeamID','SeedConference','SeedOrder']], how='left', left_on=['Season', tag], right_on=['Season', 'TeamID'])\
        .rename(columns={'SeedConference':tag+'SeedConference','SeedOrder':tag+'SeedOrder'})\
        .merge(MRegularSeasonTeamResultsYr[['Season','TeamID','WinRate']], how='left', left_on=['Season',tag], right_on=['Season','TeamID'])\
        .rename(columns={'WinRate':tag+'SeasonWinRate'})\
        .merge(MRegularSeasonTeamBox, how='left', left_on=['Season', tag], right_on=['Season', 'TeamID'])\
        .rename(columns=gen_TeamBoxDict(tag))\
        .rename(columns={'DayNum_x':'DayNum'})
    df.drop(columns=[col for col in df if col.startswith('TeamID')], inplace=True)
    df.drop(columns='DayNum_y', inplace=True)
    return df

In [None]:
df_ = MNCAATourney.query('Season >= 2003')[['label','Season','DayNum','T1','T2']]
df1 = grab_col('T1',df_)
df2 = grab_col('T2',df_)
df = df1.merge(df2, how='inner',left_on=['label','Season','DayNum','T1','T2'],right_on=['label','Season','DayNum','T1','T2'])
df['OrdinalRankDiff']=df['T2OrdinalRank'] - df['T1OrdinalRank']
df['SeedOrderDiff']=df['T2SeedOrder'] - df['T1SeedOrder']
df.drop(columns={'T1SeedConference','T2SeedConference'}, inplace=True)
df.head()

In [None]:
pd.set_option('display.max_rows', df.shape[0]+1)
df.isnull().sum().to_frame(name = 'missing')
pd.set_option('display.max_rows', 5)

In [None]:
# separate traning and test sets
sc = StandardScaler()
df_training_ = df.loc[df.Season < 2015, df.columns != 'label']
label_training = df.loc[df.Season < 2015,'label']
df_training_.set_index(['Season','DayNum','T1','T2'], inplace=True)
df_training=pd.DataFrame(sc.fit_transform(df_training_),columns = df_training_.columns)
df_training.head()
df_training.describe()

In [None]:
df_test_ = df.loc[df.Season >= 2015, df.columns != 'label']
label_test = df.loc[df.Season >= 2015,'label']
df_test_.set_index(['Season','DayNum','T1','T2'], inplace=True)
df_test = pd.DataFrame(sc.transform (df_test_),columns = df_test_.columns)
df_test.head()

In [None]:
#check the defeat rate in the training data
label_training.value_counts(normalize=True)

# Predictive Modeling
The following codes are inspired by Arthor Tok's notebook on Titanic Competition. This is a stacking ensamble model, including two stages: basic classification and prediction. First, we use basic classifiers to perform the predictions. Those predictions will be the new set of features to train the next classifier.  

In [None]:
import sklearn
from sklearn.ensemble import (RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier)
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import KFold
import xgboost as xgb

* Arthor Tok wrote a class SklearnHelper that allows one to extend the inbuilt methods and the classifiers will action the same way on each methods. Setting seeds is to ensure the results will stay the same each time we run the codes. 

In [None]:
# Some useful parameters which will come in handy later on
ntrain = df_training.shape[0]
ntest = df_test.shape[0]
SEED = 20200325 # for reproducibility
NFOLDS = 5 # set folds for out-of-fold prediction
kf = KFold(n_splits=NFOLDS, random_state=100, shuffle=True)

# Class to extend the Sklearn classifier
class SklearnHelper(object):
    def __init__(self, clf, seed=0, params=None):
        params['random_state'] = seed
        self.clf = clf(**params)

    def train(self, x_train, y_train):
        self.clf.fit(x_train, y_train)

    def predict(self, x):
        return self.clf.predict(x)
    
    def fit(self,x,y):
        return self.clf.fit(x,y)
    
    def feature_importances(self,x,y):
        print(self.clf.fit(x,y).feature_importances_)

Define a method to perform k fold cross validation, conduct the training, and make predictions of the test set on each iteration.

In [None]:
def get_oof(clf, x_train, y_train, x_test):
    oof_train = np.zeros((ntrain,))
    oof_test = np.zeros((ntest,))
    oof_test_skf = np.empty((NFOLDS, ntest))
    
    for i, (train_index, test_index) in enumerate(kf.split(oof_train)):
        x_tr = x_train[train_index]
        y_tr = y_train[train_index]
        x_te = x_train[test_index]

        clf.train(x_tr, y_tr)

        oof_train[test_index] = clf.predict(x_te)
        oof_test_skf[i, :] = clf.predict(x_test)

    oof_test[:] = oof_test_skf.mean(axis=0)
    return oof_train.reshape(-1, 1), oof_test.reshape(-1, 1)

There will be 6 basic classifiers to contribute in the 1st stage classification.
1. Logistic Regression classifier
2. Support Vector Machine
3. Random Forest classifier
4. Extra Trees classifier
5. AdaBoost classifer
6. Gradient Boosting classifer

In [None]:
# Put in our parameters for said classifiers
# Random Forest parameters

# Logistic Regression 
lr_params = {'C': 1}

# Support Vector 
svc_params = {
    'kernel' : 'linear',
    'C' : 1
    }

# Random Forest
rf_params = {
    'n_jobs': -1,
    'n_estimators': 500,
    'warm_start': False, 
    'max_depth': 6,
    'min_samples_leaf': 2,
    'max_features' : 'sqrt',
    'verbose': 0
}

# Extra Trees
et_params = {
    'n_jobs': -1,
    'n_estimators':500,
    'max_depth': 8,
    'min_samples_leaf': 2,
    'verbose': 0
}

# AdaBoost
ada_params = {
    'n_estimators': 500,
    'learning_rate' : 0.75
}

# Gradient Boosting
gb_params = {
    'n_estimators': 500,
     #'max_features': 0.2,
    'max_depth': 5,
    'min_samples_leaf': 2,
    'verbose': 0
}

In [None]:
# Create 6 objects
rf = SklearnHelper(clf=RandomForestClassifier, seed=SEED, params=rf_params)
et = SklearnHelper(clf=ExtraTreesClassifier, seed=SEED, params=et_params)
ada = SklearnHelper(clf=AdaBoostClassifier, seed=SEED, params=ada_params)
gb = SklearnHelper(clf=GradientBoostingClassifier, seed=SEED, params=gb_params)
svc = SklearnHelper(clf=SVC, seed=SEED, params=svc_params)
lr = SklearnHelper(clf=LogisticRegression, seed=SEED, params=lr_params)

In [None]:
# Create Numpy arrays of train, test and target ( Survived) dataframes to feed into our models
y_train = label_training.ravel()
x_train = df_training.values # Creates an array of the train data
x_test = df_test.values # Creats an array of the test data

In [None]:
# Create our OOF train and test predictions. These base results will be used as new features
lr_oof_train, lr_oof_test = get_oof(lr,x_train, y_train, x_test) # Logistic Regression Classifier
svc_oof_train, svc_oof_test = get_oof(svc,x_train, y_train, x_test) # Support Vector Classifier
rf_oof_train, rf_oof_test = get_oof(rf,x_train, y_train, x_test) # Random Forest
et_oof_train, et_oof_test = get_oof(et, x_train, y_train, x_test) # Extra Trees
ada_oof_train, ada_oof_test = get_oof(ada, x_train, y_train, x_test) # AdaBoost 
gb_oof_train, gb_oof_test = get_oof(gb,x_train, y_train, x_test) # Gradient Boost

print("Training is complete")

In [None]:
base_predictions_train = pd.DataFrame( {
    'LR': lr_oof_train.ravel(),
    'SVM': svc_oof_train.ravel(),
    'RandomForest': rf_oof_train.ravel(),
    'ExtraTrees': et_oof_train.ravel(),
    'AdaBoost': ada_oof_train.ravel(),
    'GradientBoost': gb_oof_train.ravel()
    })
base_predictions_train.head()

We call an XGBClassifier and fit it to the first-level train and target data and use the learned model to predict the test data.

In [None]:
#form 2nd stage training/test sets
x_train = np.concatenate(( et_oof_train, rf_oof_train, ada_oof_train, gb_oof_train, svc_oof_train, lr_oof_train), axis=1)
x_test = np.concatenate(( et_oof_test, rf_oof_test, ada_oof_test, gb_oof_test, svc_oof_test, lr_oof_test), axis=1)

#conduct 2nd level learning model via XGBoost
gbm = xgb.XGBClassifier(
    #learning_rate = 0.02,
     n_estimators= 2000,
     max_depth= 4,
     min_child_weight= 2,
     #gamma=1,
     gamma=0.9,                        
     subsample=0.8,
     colsample_bytree=0.8,
     objective= 'binary:logistic',
     nthread= -1,
     scale_pos_weight=1).fit(x_train, y_train)
predictions = gbm.predict(x_test)


Finally, the accuracy of the stacking ensamble model is around 74%.

In [None]:
gbm.score(x_test, label_test)