# Esmaeil Rezaei

# March Madness 2024

***Here is the steps to tackle this dataset***:

- `WinTeam` is considered as `team A` and `LTeam` as `team B`
- The goal is to predict if `A` beats `B`
- Value 1 is assigned to the `y` for actual winners in the dataset
- The related values of the winning team are switched with the losing teams and set to zero for `y`
- The data related to years from 1985 to 2023 are considered as training and 2024 as testing
- Logistic Regression is used first for feature selection and then for classification
- Teams in different regions are randomly seeded. Then, losers are removed and winners are seeded for the next games until slot `R6CH`.

***The features defined in this problem are as follows***:

1. **WTeamID**: Considered as `team A`, where `y = 1` if it wins
2. **LTeamID**: Considered as `team B`, where `y = 0` if A wins and vice versa
3. **WNumWins**: Number of wins of the `last year` for `team A`
4. **WNumLosses**: Number of losses of the `last year` for `team A`
5. **WTeam_grade**: Last year relative cumulative sum of the difference between `WScore` and `LScore` for `team A`. Indeed, the score difference is computed first and then the relative cumulative sum is calculated. It is considered relative because some teams may be newer.
6. **WstdWins**: Last year standard deviation (`std`) of the number of wins over the years for `team A`. It is measured to differentiate teams that consistently perform well versus those with more erratic performance patterns over time.
7. **WstdLosses**: Last year `std` of the number of losses over the years for `team A`.
8. **LNumWins**: Number of wins of the last year for `team B`
9. **LNumLosses**: Number of losses of the last year for `team B`
10. **LTeam_grade**: Last year relative cumulative sum of the difference between `WScore` and `LScore` for `team B`.
11. **LstdWins**: Last year `std` of the number of wins over the years for `team B`.
12. **LstdLosses**: Last year `std` of the number of losses over the years for `team B`.

- ### Note: For more brackets, adjust the `NUMBER_OF_BRACKETS` variable to your preferred value.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/march-machine-learning-mania-2024/Conferences.csv
/kaggle/input/march-machine-learning-mania-2024/sample_submission.csv
/kaggle/input/march-machine-learning-mania-2024/WNCAATourneyDetailedResults.csv
/kaggle/input/march-machine-learning-mania-2024/WRegularSeasonCompactResults.csv
/kaggle/input/march-machine-learning-mania-2024/MNCAATourneySeedRoundSlots.csv
/kaggle/input/march-machine-learning-mania-2024/MRegularSeasonDetailedResults.csv
/kaggle/input/march-machine-learning-mania-2024/MNCAATourneyCompactResults.csv
/kaggle/input/march-machine-learning-mania-2024/MGameCities.csv
/kaggle/input/march-machine-learning-mania-2024/WGameCities.csv
/kaggle/input/march-machine-learning-mania-2024/MSeasons.csv
/kaggle/input/march-machine-learning-mania-2024/WNCAATourneySlots.csv
/kaggle/input/march-machine-learning-mania-2024/MSecondaryTourneyTeams.csv
/kaggle/input/march-machine-learning-mania-2024/2024_tourney_seeds.csv
/kaggle/input/march-machine-learning-mania-2024/Cities.csv
/

In [2]:
path = "/kaggle/input/march-machine-learning-mania-2024/"

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder
import seaborn as sns
import plotly.express as px
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve
import warnings

warnings.filterwarnings('ignore')
%matplotlib inline

# Functions

In [4]:
# This function finds related data by considering data where at least one of the teams in WTeamID or LTeamId are in season 2024.
def data_extraction(df_seeds, df_season):
    Year = 2024
    TeamID_list = list(df_seeds.query('Season == {}'.format(Year))['TeamID'].unique())
    Season_list = list(df_seeds['Season'].unique())
    Season_list = Season_list[: Season_list.index(Year) + 1]

    data = df_season[df_season['WTeamID'].isin(TeamID_list) | df_season['LTeamID'].isin(TeamID_list)].reset_index(drop = True).copy()
    data = data[data['Season'].isin(Season_list)].reset_index(drop = True)

    return data, TeamID_list

In [5]:
# group the data to create new features
def Feature_Engineering(df):
    df.loc[:, 'WScore_diff'] = df['WScore'] - df['LScore']
    df.loc[:, 'LScore_diff'] = df['LScore'] - df['WScore']

    list_WTeamID = df['WTeamID'].unique()
    list_LTeamID = df['LTeamID'].unique()
    list_all_teams_id = list(set(list_WTeamID) | set(list_LTeamID))


    unique_seasons = df['Season'].unique()
    grouped_WLs = pd.DataFrame({
        'TeamID': list_all_teams_id * len(unique_seasons),
        'Season': [year for year in unique_seasons for _ in range(len(list_all_teams_id))]
    }).sort_values(by=['Season', 'TeamID'])


    grouped_WLs[['WTeamID','NumWins','LTeamID','NumLosses','num_Wdiff','num_Ldiff']] = 0

    num_wins = df.groupby(['Season', 'WTeamID']).size().reset_index(name="NumWins").sort_values(by=['Season', 'WTeamID'])
    num_losses = df.groupby(['Season', 'LTeamID']).size().reset_index(name="NumLosses").sort_values(by=['Season', 'LTeamID'])
    std_wins = df.groupby(['WTeamID', 'Season']).std().reset_index().rename(columns={"WScore_diff": "stdWins"}).sort_values(by=['Season', 'WTeamID'])
    std_losses = df.groupby(['LTeamID', 'Season']).std().reset_index().rename(columns={"LScore_diff": "stdLosses"}).sort_values(by=['Season', 'LTeamID'])
    num_Wdiff = df.groupby(['Season', 'WTeamID'])['WScore_diff'].sum().reset_index(name="num_Wdiff").sort_values(by=['Season', 'WTeamID'])
    num_Ldiff = df.groupby(['Season', 'LTeamID'])['LScore_diff'].sum().reset_index(name="num_Ldiff").sort_values(by=['Season', 'LTeamID'])

    for season in num_wins.Season.unique():

        season_filter = grouped_WLs['Season'] == season

        num_wins_season = num_wins[num_wins['Season'] == season]
        num_losses_season = num_losses[num_losses['Season'] == season]
        std_wins_season = std_wins[std_wins['Season'] == season]
        std_losses_season = std_losses[std_losses['Season'] == season]
        num_Wdiff_season = num_Wdiff[num_Wdiff['Season'] == season]
        num_Ldiff_season = num_Ldiff[num_Ldiff['Season'] == season]

        num_wins_idx = num_wins_season['WTeamID'].tolist()
        num_losses_idx = num_losses_season['LTeamID'].tolist()
        std_wins_idx = std_wins_season['WTeamID'].tolist()
        std_losses_idx = std_losses_season['LTeamID'].tolist()
        num_Wdiff_idx = num_Wdiff_season['WTeamID'].tolist()
        num_Ldiff_idx = num_Ldiff_season['LTeamID'].tolist()

        grouped_WLs.loc[season_filter & grouped_WLs['TeamID'].isin(num_wins_idx), ['WTeamID', 'NumWins']] = num_wins_season[['WTeamID', 'NumWins']].values
        grouped_WLs.loc[season_filter & grouped_WLs['TeamID'].isin(num_losses_idx), ['LTeamID', 'NumLosses']] = num_losses_season[['LTeamID', 'NumLosses']].values
        grouped_WLs.loc[season_filter & grouped_WLs['TeamID'].isin(std_wins_idx), ['WTeamID', 'stdWins']] = std_wins_season[['WTeamID', 'stdWins']].values
        grouped_WLs.loc[season_filter & grouped_WLs['TeamID'].isin(std_losses_idx), ['LTeamID', 'stdLosses']] = std_losses_season[['LTeamID', 'stdLosses']].values
        grouped_WLs.loc[season_filter & grouped_WLs['TeamID'].isin(num_Wdiff_idx), ['WTeamID', 'num_Wdiff']] = num_Wdiff_season[['WTeamID', 'num_Wdiff']].values
        grouped_WLs.loc[season_filter & grouped_WLs['TeamID'].isin(num_Ldiff_idx), ['LTeamID', 'num_Ldiff']] = num_Ldiff_season[['LTeamID', 'num_Ldiff']].values

    grouped_WLs['Score_diff'] = grouped_WLs['num_Wdiff'] + grouped_WLs['num_Ldiff']
    grouped_WLs = grouped_WLs.drop(columns=['num_Wdiff', 'num_Ldiff'], axis=1)
    df = df.drop(columns = ['WScore', 'LScore', 'WScore_diff', 'LScore_diff'], axis = 1)
    return df, grouped_WLs


In [6]:
# Another important feature is computing team grade using relative cumulative sum
def Team_grade(grouped_df):
    TeamID_list = grouped_df['TeamID'].unique()
    grouped_df['Team_grade'] = 0.0
    for team in TeamID_list:
        df_team_idx = grouped_df['TeamID'] == team
        grouped_df.loc[df_team_idx, 'Team_grade'] = round(grouped_df.loc[df_team_idx, 'Score_diff'].expanding().mean(), 3)

    return grouped_df.drop(columns='Score_diff', axis = 1)

In [7]:
# Add the newly created features to the dataset
def Feature_Adder(df, grouped_df, TeamID_list):
    df_temp = pd.DataFrame(data = np.zeros((df.shape[0], 10)), columns = ['WNumWins', 'WNumLosses', 'WTeam_grade', 'WstdWins', 'WstdLosses', 'LNumWins', 'LNumLosses', 'LstdWins', 'LstdLosses', 'LTeam_grade'])

    Season_list = df.Season.unique()

    for season in Season_list:
        current_year = season
        if current_year == 1985:
            last_year = 1985
        else:
            last_year = season - 1

        for team in TeamID_list:
            team_idx_in_group = (grouped_df['Season'] == last_year) & (grouped_df['TeamID'] == team)
            df_Widx = (df['Season'] == current_year) & (df['WTeamID'] == team)
            df_Lidx = (df['Season'] == current_year) & (df['LTeamID'] == team)

            if (sum(df_Widx) > 0) & (sum(team_idx_in_group)) > 0:
                df_temp.loc[df_Widx, 'WNumWins'] = grouped_df.loc[team_idx_in_group, 'NumWins'].values
                df_temp.loc[df_Widx, 'WNumLosses'] = grouped_df.loc[team_idx_in_group, 'NumLosses'].values
                df_temp.loc[df_Widx, 'WstdWins'] = grouped_df.loc[team_idx_in_group, 'stdWins'].values
                df_temp.loc[df_Widx, 'WstdLosses'] = grouped_df.loc[team_idx_in_group, 'stdLosses'].values
                df_temp.loc[df_Widx, 'WTeam_grade'] = grouped_df.loc[team_idx_in_group, 'Team_grade'].values

            if (sum(df_Lidx) > 0) & (sum(team_idx_in_group)) > 0:
                df_temp.loc[df_Lidx, 'LNumWins'] = grouped_df.loc[team_idx_in_group, 'NumWins'].values
                df_temp.loc[df_Lidx, 'LNumLosses'] = grouped_df.loc[team_idx_in_group, 'NumLosses'].values
                df_temp.loc[df_Lidx, 'LstdWins'] = grouped_df.loc[team_idx_in_group, 'stdWins'].values
                df_temp.loc[df_Lidx, 'LstdLosses'] = grouped_df.loc[team_idx_in_group, 'stdLosses'].values
                df_temp.loc[df_Lidx, 'LTeam_grade'] = grouped_df.loc[team_idx_in_group, 'Team_grade'].values

    return pd.concat([df, df_temp], axis = 1)

In [8]:
# Preparing data for training by creating a binary variable for y. We assign 1 for actual winners, then change the places of winners
# and losers and set the value 0 to y.
def Proliferating_Data(df):
    # Asign 1 for actual winers
    df_actual = df.copy()
    df_actual['WTeamID_result'] = 1

    # Assign 0 for winners who are not actual winners, as we are simply putting them in the places of winners
    df_changed = df.copy()
    df_changed['WTeamID'], df_changed['WNumWins'], df_changed['WNumLosses'], df_changed['WstdWins'], df_changed['WstdLosses'], df_changed['WTeam_grade'] = df_changed['LTeamID'], df_changed['LNumWins'], df_changed['LstdWins'], df_changed['LstdLosses'], df_changed['LNumLosses'], df_changed['LTeam_grade']
    df_changed['LTeamID'], df_changed['LNumWins'], df_changed['LstdWins'], df_changed['LstdLosses'], df_changed['LNumLosses'], df_changed['LTeam_grade'] = df['WTeamID'], df['WNumWins'], df['WNumLosses'], df['WstdWins'], df['WstdLosses'], df['WTeam_grade']
    df_changed['WTeamID_result'] = 0

    df = pd.concat([df_actual, df_changed], ignore_index = True)
    df = df.sample(frac=1).reset_index(drop=True)

    return df

In [9]:
# Removes features with p-value > 0
def BackwardElemination(dataframe, dependent_variable):
    num_cols = list(dataframe.columns)
    if len(num_cols) == 0:
        print("No column with pvalue < 0.05")
    while len(num_cols) > 0:
        model = sm.Logit(dependent_variable, dataframe)
        result_logit = model.fit(disp = 0)
        largest_pvalue = round(result_logit.pvalues, 3).nlargest(1)
        if largest_pvalue[0] < 0.05:
            return result_logit, dataframe
            break
        else:
            dataframe = dataframe.drop(columns = largest_pvalue.index, axis = 0)

In [10]:
# Find the play-in games to prepare them for competition and make predictions for the winning team in place A.
def play_in_teams(Seeds):
    seeds_2024 = Seeds.query('Season == 2024').reset_index(drop = True)

    play_in_seeds = []
    play_in_TeamIDs = []
    for idx, seed in enumerate(seeds_2024.Seed):
        if len(seed) > 3:
            play_in_seeds.append(seed)
            play_in_TeamIDs.append(seeds_2024.loc[idx, 'TeamID'])

    team_A = []
    team_B = []

    for idx, ID in enumerate(play_in_TeamIDs):
        if idx % 2 == 0:
            team_A.append(play_in_TeamIDs[idx])
        else:
            team_B.append(play_in_TeamIDs[idx])


    return team_A, team_B, play_in_seeds, play_in_TeamIDs

In [11]:
# Preparation before adding the newly created features to the play-in data
def play_in_data_generator(Team_A, Team_B, play_in_TeamID_ls):
    df_0 = pd.DataFrame({'Season': np.full(len(Team_A), 2024)})
    data = pd.DataFrame({'WTeamID': Team_A, 'LTeamID': Team_B})

    return pd.concat([df_0, data], axis = 1)

In [12]:
# Removing loser teams from the play-in games
def Removing_Play_in_Lossers(seed_df):
    for idx, res in enumerate(y_pred_play_in):
        if res == 0:
            seed_df = seed_df.drop(seed_df.query('TeamID == {}'.format(Team_A[idx])).index).reset_index(drop=True)
            seed_df.loc[seed_df.query('TeamID == {}'.format(Team_B[idx])).index, 'Seed'] = seed_df.loc[seed_df.query('TeamID == {}'.format(Team_B[idx])).index, 'Seed'].str[:3]
        else:
            seed_df = seed_df.drop(seed_df.query('TeamID == {}'.format(Team_B[idx])).index).reset_index(drop=True)
            seed_df.loc[seed_df.query('TeamID == {}'.format(Team_A[idx])).index, 'Seed'] = seed_df.loc[seed_df.query('TeamID == {}'.format(Team_A[idx])).index, 'Seed'].str[:3]
    return seed_df

In [13]:
# Seeding teams after each competition
def seeding(teams_seeds_df):
    data = pd.DataFrame(columns=['Seed_Team_A', 'Seed_Team_B', 'WTeamID', 'LTeamID'])
    M, _ = teams_seeds_df.shape

    seed_W = teams_seeds_df[teams_seeds_df['Seed'].str[0] == 'W']
    seed_X = teams_seeds_df[teams_seeds_df['Seed'].str[0] == 'X']
    seed_Y = teams_seeds_df[teams_seeds_df['Seed'].str[0] == 'Y']
    seed_Z = teams_seeds_df[teams_seeds_df['Seed'].str[0] == 'Z']

    if len(seed_W) > 1:
        # Shuffle seeds
        rand_idx = np.random.choice(len(seed_W), size=len(seed_W), replace=False)
        seed_W = seed_W.iloc[rand_idx, :].reset_index(drop = True)
        Team_A_W = seed_W.iloc[:len(seed_W) // 2, :].reset_index(drop = True)
        Team_B_W = seed_W.iloc[len(seed_W) // 2:, :].reset_index(drop = True)
        data_W = pd.concat([Team_A_W['TeamID'], Team_B_W['TeamID']], axis = 1)
        data_W = pd.DataFrame(data = data_W.values , columns = ['WTeamID', 'LTeamID'])
        data_W['Seed_Team_A'] = Team_A_W.Seed
        data_W['Seed_Team_B'] = Team_B_W.Seed
        data = pd.concat([data, data_W], ignore_index=True)

        rand_idx = np.random.choice(len(seed_X), size=len(seed_X), replace=False)
        seed_X = seed_X.iloc[rand_idx, :].reset_index(drop = True)
        Team_A_X = seed_X.iloc[:len(seed_X) // 2, :].reset_index(drop = True)
        Team_B_X = seed_X.iloc[len(seed_X) // 2:, :].reset_index(drop = True)
        data_X = pd.concat([Team_A_X['TeamID'], Team_B_X['TeamID']], axis = 1)
        data_X = pd.DataFrame(data = data_X.values , columns = ['WTeamID', 'LTeamID'])
        data_X['Seed_Team_A'] = Team_A_X.Seed
        data_X['Seed_Team_B'] = Team_B_X.Seed
        data = pd.concat([data, data_X], ignore_index=True)

        rand_idx = np.random.choice(len(seed_Y), size=len(seed_Y), replace=False)
        seed_Y = seed_Y.iloc[rand_idx, :].reset_index(drop = True)
        Team_A_Y = seed_Y.iloc[:len(seed_Y) // 2, :].reset_index(drop = True)
        Team_B_Y = seed_Y.iloc[len(seed_Y) // 2:, :].reset_index(drop = True)
        data_Y = pd.concat([Team_A_Y['TeamID'], Team_B_Y['TeamID']], axis = 1)
        data_Y = pd.DataFrame(data = data_Y.values , columns = ['WTeamID', 'LTeamID'])
        data_Y['Seed_Team_A'] = Team_A_Y.Seed
        data_Y['Seed_Team_B'] = Team_B_Y.Seed
        data = pd.concat([data, data_Y], ignore_index=True)

        rand_idx = np.random.choice(len(seed_Z), size=len(seed_Z), replace=False)
        seed_Z = seed_Z.iloc[rand_idx, :].reset_index(drop = True)
        Team_A_Z = seed_Z.iloc[:len(seed_Z) // 2, :].reset_index(drop = True)
        Team_B_Z = seed_Z.iloc[len(seed_Z) // 2:, :].reset_index(drop = True)
        data_Z = pd.concat([Team_A_Z['TeamID'], Team_B_Z['TeamID']], axis = 1)
        data_Z = pd.DataFrame(data = data_Z.values , columns = ['WTeamID', 'LTeamID'])
        data_Z['Seed_Team_A'] = Team_A_Z.Seed
        data_Z['Seed_Team_B'] = Team_B_Z.Seed
        data = pd.concat([data, data_Z], ignore_index=True)

    elif len(teams_seeds_df) == 4:
        data = pd.DataFrame({'Seed_Team_A':[seed_W.Seed.values[0], seed_Y.Seed.values[0]],
                             'Seed_Team_B':[seed_X.Seed.values[0], seed_Z.Seed.values[0]],
                             'WTeamID':[seed_W.TeamID.values[0], seed_Y.TeamID.values[0]],
                             'LTeamID':[seed_X.TeamID.values[0], seed_Z.TeamID.values[0]]})
        rand_idx = np.random.choice(2, size=2, replace=False)
        data = data.iloc[rand_idx, :].reset_index(drop = True)
    else:
        seed_WX = pd.concat([seed_W, seed_X])
        seed_YZ = pd.concat([seed_Y, seed_Z])

        data = pd.DataFrame({'Seed_Team_A':seed_WX.Seed.values[0],
                             'Seed_Team_B':seed_YZ.Seed.values[0],
                             'WTeamID':seed_WX.TeamID.values[0],
                             'LTeamID':seed_YZ.TeamID.values[0]}, index=[0])

    # Adding the season column to the DataFrame
    data.insert(0, 'Season', 2024)  # It's inplace

    return data

In [14]:
# Discarding the loser teams and preparing data for seeding
def SLOT_df(seeded_result):
    M, N = seeded_result.shape
    df_temp = pd.DataFrame(data = np.zeros((M, 3)), columns = ['Season', 'Seed', 'TeamID'])
    df_temp['Season'] = 2024

    A_win = seeded_result['Result'] == 1
    B_win = seeded_result['Result'] == 0

    df_temp.loc[A_win, ['Seed', 'TeamID']] = seeded_result.loc[A_win, ['Seed_Team_A', 'WTeamID']].values
    df_temp.loc[B_win, ['Seed', 'TeamID']] = seeded_result.loc[B_win, ['Seed_Team_B', 'LTeamID']].values

    df_temp.TeamID = df_temp.TeamID.astype(int)
    return df_temp

### Read the Men Files

In [15]:
Seeds_Men = pd.read_csv(path + "MNCAATourneySeeds.csv")
Seeds_Men = Seeds_Men.drop_duplicates()

In [16]:
Season_Men = pd.read_csv(path + "MRegularSeasonCompactResults.csv")
Season_Men = Season_Men.iloc[:, :-2]
Season_Men = Season_Men.drop_duplicates()
Season_Men.drop(columns = ['DayNum'], inplace = True) # it did not significantly improve the accuracy
Season_Men = Season_Men

In [17]:
Season_Men.head()

Unnamed: 0,Season,WTeamID,WScore,LTeamID,LScore
0,1985,1228,81,1328,64
1,1985,1106,77,1354,70
2,1985,1112,63,1223,56
3,1985,1165,70,1432,54
4,1985,1192,86,1447,74


In [18]:
Season_Men.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 187289 entries, 0 to 187288
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype
---  ------   --------------   -----
 0   Season   187289 non-null  int64
 1   WTeamID  187289 non-null  int64
 2   WScore   187289 non-null  int64
 3   LTeamID  187289 non-null  int64
 4   LScore   187289 non-null  int64
dtypes: int64(5)
memory usage: 7.1 MB


In [19]:
Season_Men.isnull().sum()

Season     0
WTeamID    0
WScore     0
LTeamID    0
LScore     0
dtype: int64

# Feature Selection for Men Dataset

### Feature Engineering

In [20]:
df, TeamID_list = data_extraction(Seeds_Men, Season_Men)
df, grouped_df_Men = Feature_Engineering(df)
grouped_df_Men = grouped_df_Men.fillna(0) # for teams with std of wins and losses of zero
grouped_df_Men = Team_grade(grouped_df_Men)
df_Men = Feature_Adder(df, grouped_df_Men, TeamID_list)

### Correlation

In [21]:
matrix_corr = df_Men.drop(columns = ['WTeamID', 'LTeamID'], axis = 1).corr()

fig = px.imshow(matrix_corr,
                labels=dict(x="Columns", y="Columns", color="Correlation Coefficient"),
                x=matrix_corr.columns,
                y=matrix_corr.columns,
                width=800, height=800,
                title="Correlation Matrix")

fig.update_xaxes(side="top")
fig.show()

### Logit for Feature Selection
First, I utilize a logistic regression model using statsmodels to gain insights into the statistical aspects of the model for feature selection. Later, we I transition to using the sklearn library.

In [22]:
df = Proliferating_Data(df_Men)

X = sm.add_constant(df.drop(columns = 'WTeamID_result', axis = 1))
y = df['WTeamID_result']

In [23]:
# Fit a logistic regression model using statsmodels
model = sm.Logit(y, X)
result_logit = model.fit()
result_logit.summary()

Optimization terminated successfully.
         Current function value: 0.274811
         Iterations 8


0,1,2,3
Dep. Variable:,WTeamID_result,No. Observations:,130744.0
Model:,Logit,Df Residuals:,130730.0
Method:,MLE,Df Model:,13.0
Date:,"Sat, 06 Apr 2024",Pseudo R-squ.:,0.6035
Time:,01:21:56,Log-Likelihood:,-35930.0
converged:,True,LL-Null:,-90625.0
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-3.616e-13,1.700,-2.13e-13,1.000,-3.331,3.331
Season,1.785e-16,0.001,2.12e-13,1.000,-0.002,0.002
WTeamID,0.0006,9.13e-05,6.552,0.000,0.000,0.001
LTeamID,-0.0006,9.13e-05,-6.552,0.000,-0.001,-0.000
WNumWins,-0.0601,0.002,-26.615,0.000,-0.064,-0.056
WNumLosses,0.2977,0.003,95.455,0.000,0.292,0.304
WTeam_grade,-0.0011,0.000,-8.761,0.000,-0.001,-0.001
WstdWins,0.4726,0.005,87.180,0.000,0.462,0.483
WstdLosses,-0.5587,0.005,-120.695,0.000,-0.568,-0.550


### Backward Elemination

In [24]:
result, X_new = BackwardElemination(dataframe = X, dependent_variable = y)
result.summary()

0,1,2,3
Dep. Variable:,WTeamID_result,No. Observations:,130744.0
Model:,Logit,Df Residuals:,130732.0
Method:,MLE,Df Model:,11.0
Date:,"Sat, 06 Apr 2024",Pseudo R-squ.:,0.6035
Time:,01:21:57,Log-Likelihood:,-35930.0
converged:,True,LL-Null:,-90625.0
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
WTeamID,0.0006,6.54e-05,9.139,0.000,0.000,0.001
LTeamID,-0.0006,6.54e-05,-9.139,0.000,-0.001,-0.000
WNumWins,-0.0601,0.002,-26.706,0.000,-0.064,-0.056
WNumLosses,0.2977,0.003,95.690,0.000,0.292,0.304
WTeam_grade,-0.0011,0.000,-8.778,0.000,-0.001,-0.001
WstdWins,0.4726,0.005,87.260,0.000,0.462,0.483
WstdLosses,-0.5587,0.005,-121.233,0.000,-0.568,-0.550
LNumWins,0.0601,0.002,26.706,0.000,0.056,0.064
LNumLosses,0.5587,0.005,121.233,0.000,0.550,0.568


### Note:
Columns "const" and "Season" have been discarded in the men dataset, and they will not be considered in the next part.

In [25]:
conf = np.exp(result.conf_int())
params = np.exp(result.params)
conf['OR'] = params
pvalue=round(result.pvalues,3)
conf['pvalue'] = pvalue
conf.columns = ['CI 95%(2.5%)', 'CI 95%(97.5%)', 'Odds Ratio','pvalue']
print(conf)

             CI 95%(2.5%)  CI 95%(97.5%)  Odds Ratio  pvalue
WTeamID          1.000470       1.000726    1.000598     0.0
LTeamID          0.999274       0.999530    0.999402     0.0
WNumWins         0.937553       0.945856    0.941695     0.0
WNumLosses       1.338576       1.355001    1.346764     0.0
WTeam_grade      0.998674       0.999158    0.998916     0.0
WstdWins         1.587286       1.621348    1.604226     0.0
WstdLosses       0.566795       0.577128    0.571938     0.0
LNumWins         1.057243       1.066607    1.061914     0.0
LNumLosses       1.732719       1.764306    1.748441     0.0
LstdWins         0.738007       0.747062    0.742521     0.0
LstdLosses       0.616771       0.630006    0.623353     0.0
LTeam_grade      1.000843       1.001327    1.001085     0.0


### Logistic Regression for Men Dataset and Predicting Winers of 2024 

In [26]:
scaler = StandardScaler()

data = X.copy()

num_cols = ['WNumWins','WNumLosses','WstdWins','WstdLosses','WTeam_grade','LNumWins','LNumLosses','LstdWins','LstdLosses','LTeam_grade']
cat_cols = ['WTeamID', 'LTeamID']

data['WTeamID_result'] = y
data = pd.get_dummies(data, columns=cat_cols, drop_first=True)
data[num_cols] = scaler.fit_transform(data[num_cols])

data_test = data.query('Season == 2024').reset_index(drop = True)
data_train = data.drop(data.query('Season == 2024').index, axis = 0).reset_index(drop = True)

y_test = data_test['WTeamID_result'].reset_index(drop = True)
y_train = data_train['WTeamID_result'].reset_index(drop = True)

X_test = data_test.drop(columns = 'WTeamID_result', axis = 1)
X_train = data_train.drop(columns = 'WTeamID_result', axis = 1)

In [27]:
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

In [28]:
cm = confusion_matrix(y_test, y_pred)
conf_matrix = pd.DataFrame(data = cm, columns = ["Predicted: 0","Predicted: 1"], index = ["Actual: 0", "Actual: 1"])

In [29]:
fig = px.imshow(conf_matrix,
                labels=dict(x="Predicted Values", y="Actual Values", color="Values"),
                x=["Negative (0)", "Positive (1)"],
                y=["Negative (0)", "Positive (1)"],
                color_continuous_scale='Blues',  # Adjust color scale if needed
                width=600, height=600,
                text_auto=True,
                title="Confusion Matrix")

fig.update_xaxes(side="top")
fig.show()



In [30]:
TN = cm[0,0]
TP = cm[1,1]
FN = cm[1,0]
FP = cm[0,1]


print("------------------------------  The Statistics Results  ------------------------------\n")
print("Accuracy: {}".format(round((TP+TN)/(TP+TN+FP+FN), 4)))
print("Missclassification: {}".format(round(1 - (TP+TN)/(TP+TN+FP+FN), 4)))
print("Recall/Sensitivity/True Positive Rate (TPR): {}".format(round((TP)/(TP+FP), 4)))
print("Specificity/Discriminant Power/True Negative Rate (TNR): {}".format(round((TN)/(FP+TN), 4)))
print("Positive Predictive Value (PPV)/Precision/precision of the positive class: {}".format(round((TP)/(TP+FP), 4)))
print("Negative Predictive Value (NPV)/precision of the negative class: {}".format(round((TN)/(TN+FN), 4)))
sensitivity = TP/(TP+FN)
specificity = TN/(TN+FP)
print("Positive Likelihood Ratio: {}".format(round(sensitivity/(1 - specificity), 4)))
print("Negative Likelihood Ratio: {}".format(round((1 - sensitivity)/ specificity, 4)))
print("-------------------------------------------------------------------------------------\n")

------------------------------  The Statistics Results  ------------------------------

Accuracy: 0.8843
Missclassification: 0.1157
Recall/Sensitivity/True Positive Rate (TPR): 0.8843
Specificity/Discriminant Power/True Negative Rate (TNR): 0.8843
Positive Predictive Value (PPV)/Precision/precision of the positive class: 0.8843
Negative Predictive Value (NPV)/precision of the negative class: 0.8843
Positive Likelihood Ratio: 7.6455
Negative Likelihood Ratio: 0.1308
-------------------------------------------------------------------------------------



In [31]:
y_pred_prob_yes = model.predict_proba(X_test)[:, 1]
FPR, TPR, thresholds = roc_curve(y_test, y_pred_prob_yes.reshape(-1, 1))
roc_data = pd.DataFrame({
    'FPR': FPR,
    'TPR': TPR
})
fig = px.line(roc_data, x='FPR', y='TPR',
              title='ROC curve for MM 2024',
              labels={'FPR': 'False positive rate (1-Specificity)', 'TPR': 'True positive rate (Sensitivity)'},
              width=500, height=500)
fig.show()



# Simulate Brackets for Teams in 2024 (Men Dataset)

### Identify the winners of the play-in games and eliminate the losers

In [32]:
# Prepare play-in matches to predict which ones will advance
Team_A, Team_B, play_in_seeds, play_in_TeamID_ls = play_in_teams(Seeds_Men)
play_in_data = play_in_data_generator(Team_A, Team_B, play_in_TeamID_ls)
play_in_df = Feature_Adder(play_in_data, grouped_df_Men, play_in_TeamID_ls)

In [33]:
# One-hot encoding and standardization
data = X.drop(columns = ['const', 'Season'], axis = 1).copy()
data = data.drop(X.query('Season == 2024').index).reset_index(drop = True)
play_in_df = play_in_df.drop(columns = 'Season', axis = 1)
data = pd.concat([data, play_in_df], ignore_index = True)
data = pd.get_dummies(data, columns=['WTeamID', 'LTeamID'], drop_first=True)
data[num_cols] = scaler.fit_transform(data[num_cols])
play_in_df = data.iloc[-play_in_df.shape[0]:, :].reset_index(drop = True)
data = data.iloc[:-play_in_df.shape[0], :]

In [34]:
# Refit the model after discarding constants and the season variable
model.fit(data, y_train)

In [35]:
y_pred_play_in = model.predict(play_in_df)
pd.DataFrame({'Team A': Team_A, 'Team B': Team_B, 'Wining Team A?': y_pred_play_in})

Unnamed: 0,Team A,Team B,Wining Team A?
0,1224,1447,1
1,1161,1438,1
2,1212,1286,0
3,1129,1160,1


In [36]:
# Removing losing teams from the play-in games
seed_df = Seeds_Men.query('Season == 2024').reset_index(drop=True)
seeds_cleaned = Removing_Play_in_Lossers(seed_df)
R_seed = seeds_cleaned

In [37]:
# Possible slots
slots = ['R1W1', 'R1W2', 'R1W3', 'R1W4', 'R1W5', 'R1W6', 'R1W7', 'R1W8',
         'R1X1', 'R1X2', 'R1X3', 'R1X4', 'R1X5', 'R1X6', 'R1X7', 'R1X8',
         'R1Y1', 'R1Y2', 'R1Y3', 'R1Y4', 'R1Y5', 'R1Y6', 'R1Y7', 'R1Y8',
         'R1Z1', 'R1Z2', 'R1Z3', 'R1Z4', 'R1Z5', 'R1Z6', 'R1Z7', 'R1Z8',
         'R2W1', 'R2W2', 'R2W3', 'R2W4',
         'R2X1', 'R2X2', 'R2X3', 'R2X4',
         'R2Y1', 'R2Y2', 'R2Y3', 'R2Y4',
         'R2Z1', 'R2Z2', 'R2Z3', 'R2Z4',
         'R3W1', 'R3W2',
         'R3X1', 'R3X2',
         'R3Y1', 'R3Y2',
         'R3Z1', 'R3Z2',
         'R4W1',
         'R4X1',
         'R4Y1',
         'R4Z1',
         'R5WX',
         'R5YZ',
         'R6CH']

### Simulation

In [38]:

'''
    Here is the simulation part and where the NUMBER OF BRACKETS is defined.
    
'''


NUMBER_OF_BRACKETS = 1000
num_brack = NUMBER_OF_BRACKETS
submission_M = pd.DataFrame(data = np.array(np.zeros((63*num_brack, 4))), columns = ['Tournament', 'Bracket', 'Slot', 'Team'])
submission_M['Slot'] = slots * num_brack
submission_M['Tournament'] = 'M'


seeded_team = seeding(seeds_cleaned)
R_data = Feature_Adder(seeded_team.drop(columns = ['Seed_Team_A', 'Seed_Team_B'], axis = 1), grouped_df_Men, list(np.concatenate([seeded_team['WTeamID'].values, seeded_team['LTeamID'].values])))
R_data['WTeamID_result'] = 0

# proliferate
data_1985_2023 = df_Men.drop(df_Men.query('Season == 2024').index)
proliferated_df = Proliferating_Data(data_1985_2023)

# shuffle
M, N = proliferated_df.shape
rand_idx = np.random.choice(M, (1,M), replace = False).squeeze()
proliferated_df = proliferated_df.iloc[rand_idx, :].reset_index(drop = True)

# Scaling and OneHotEncoding
df = pd.concat([proliferated_df, R_data]).reset_index(drop = True)
df = pd.get_dummies(df, columns=cat_cols, drop_first=True)

# split data
X_test_temp = df.query('Season == 2024')
X_train_temp = df.drop(X_test_temp.index).reset_index(drop = True)

X_test = X_test_temp.drop(columns = ['Season', 'WTeamID_result'], axis = 1).reset_index(drop = True)
X_train = X_train_temp.drop(columns = ['Season', 'WTeamID_result'], axis = 1).reset_index(drop = True)
y_train = X_train_temp['WTeamID_result'].reset_index(drop = True)

# Scaling
scaler.fit(X_train)
X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
X_test[num_cols] = scaler.fit_transform(X_test[num_cols])

model.fit(X_train, y_train)


for bracket in np.arange(1, num_brack+1):
    if (bracket + 1) % 10 == 0:
        print('Simulating bracket {} of {} ...'.format((bracket + 1), num_brack))
    R_seed = seeds_cleaned
    SEEDs = []
    for Round in np.arange(1, 7):
        seeded_team = seeding(R_seed)
        R_data = Feature_Adder(seeded_team.drop(columns = ['Seed_Team_A', 'Seed_Team_B'], axis = 1), grouped_df_Men, list(np.concatenate([seeded_team['WTeamID'].values, seeded_team['LTeamID'].values])))
        R_data['WTeamID_result'] = 0
        #Fitting
        R_data.drop(columns = ['WTeamID_result'], axis = 1, inplace = True)
        X_test_copy = X_train.iloc[:R_data.shape[0], :].reset_index(drop = True)

        cols_to_drop = ['WNumWins', 'WNumLosses', 'WTeam_grade', 'WstdWins', 'WstdLosses',
                        'LNumWins', 'LNumLosses', 'LstdWins', 'LstdLosses', 'LTeam_grade']
        X_test_copy.drop(columns = cols_to_drop, axis = 1, inplace = True)
        
        X_test_copy.iloc[:, :] = False
        X_test = pd.concat([R_data, X_test_copy], axis = 1)

        # Select rows where either WTeamID or LTeamID matches with the encoded columns name and set them to True
        for i in np.arange(len(X_test)):
            X_test.loc[i, ['WTeamID_{}'.format(X_test.WTeamID[i]), 'LTeamID_{}'.format(X_test.LTeamID[i])]] = True

        X_test.drop(columns = ['Season', 'WTeamID', 'LTeamID'], axis = 1, inplace = True)
                        
        R_pred = model.predict(X_test)
        seeded_result = seeded_team.copy()
        seeded_result['Result'] = R_pred.reshape(-1, 1)

        R_seed = SLOT_df(seeded_result)
        SEEDs.extend(list(R_seed.loc[:, 'Seed'].values))
    submission_M.loc[(bracket-1)*63:bracket*63-1, 'Team'] = SEEDs
    submission_M.loc[(bracket-1)*63:bracket*63-1, 'Bracket'] = int(bracket)

submission_M['Bracket'] = submission_M['Bracket'].astype(int)

Simulating bracket 10 of 1000 ...
Simulating bracket 20 of 1000 ...
Simulating bracket 30 of 1000 ...
Simulating bracket 40 of 1000 ...
Simulating bracket 50 of 1000 ...
Simulating bracket 60 of 1000 ...
Simulating bracket 70 of 1000 ...
Simulating bracket 80 of 1000 ...
Simulating bracket 90 of 1000 ...
Simulating bracket 100 of 1000 ...
Simulating bracket 110 of 1000 ...
Simulating bracket 120 of 1000 ...
Simulating bracket 130 of 1000 ...
Simulating bracket 140 of 1000 ...
Simulating bracket 150 of 1000 ...
Simulating bracket 160 of 1000 ...
Simulating bracket 170 of 1000 ...
Simulating bracket 180 of 1000 ...
Simulating bracket 190 of 1000 ...
Simulating bracket 200 of 1000 ...
Simulating bracket 210 of 1000 ...
Simulating bracket 220 of 1000 ...
Simulating bracket 230 of 1000 ...
Simulating bracket 240 of 1000 ...
Simulating bracket 250 of 1000 ...
Simulating bracket 260 of 1000 ...
Simulating bracket 270 of 1000 ...
Simulating bracket 280 of 1000 ...
Simulating bracket 290 of 100

In [39]:
# Merge the datasets on the 'Team' column in submission_W and the 'Seed' column in seeds_cleaned
MTeams = pd.read_csv(path + 'MTeams.csv')

submission_M_merged = submission_M.copy()
submission_M_merged = pd.merge(submission_M_merged, seeds_cleaned, left_on='Team', right_on='Seed', how='left')
submission_M_merged.drop(['Seed', 'Season'], axis=1, inplace=True)
submission_M_merged = pd.merge(submission_M_merged, MTeams, on='TeamID', how='left')

submission_M_merged['TeamInfo'] = submission_M_merged.Team+'('+ submission_M_merged.TeamName+')'
submission_M_merged

Unnamed: 0,Tournament,Bracket,Slot,Team,TeamID,TeamName,FirstD1Season,LastD1Season,TeamInfo
0,M,1,R1W1,W06,1140,BYU,1985,2024,W06(BYU)
1,M,1,R1W2,W08,1194,FL Atlantic,1994,2024,W08(FL Atlantic)
2,M,1,R1W3,W16,1391,Stetson,1985,2024,W16(Stetson)
3,M,1,R1W4,W13,1463,Yale,1985,2024,W13(Yale)
4,M,1,R1W5,W11,1182,Duquesne,1985,2024,W11(Duquesne)
...,...,...,...,...,...,...,...,...,...
62995,M,1000,R4Y1,Y13,1359,Samford,1985,2024,Y13(Samford)
62996,M,1000,R4Z1,Z16,1255,Longwood,2005,2024,Z16(Longwood)
62997,M,1000,R5WX,Y13,1359,Samford,1985,2024,Y13(Samford)
62998,M,1000,R5YZ,X16,1224,Howard,1985,2024,X16(Howard)


### Visualize Number of Wins Per Region for Different Teams Across Various Possible Brackets

In [40]:
submission_M_merged['SLOT'] = [slot[:2] for slot in slots]*num_brack
sub_grouped = submission_M_merged.groupby(['SLOT', 'TeamInfo'])['Team'].size().reset_index(name='numWins')

# Create and display treemap
fig = px.treemap(sub_grouped, path=["SLOT", "TeamInfo"],
                 values='numWins', color='numWins',
                 hover_data=['TeamInfo'], color_continuous_scale="Greens",
                 )

# Update layout and display treemap
fig.update_layout(margin=dict(t=50, l=25, r=25, b=25), height=1000, title="Number of Wins Per Region for Different Teams Across Various Possible Brackets")
fig.show()

In [41]:
# Visualize the frequency of wins for teams across different slots (Men dataset)
fig = px.sunburst(sub_grouped, path=['SLOT', 'TeamInfo'], values='numWins',
                  color='numWins', hover_data=['SLOT'])

fig.update_layout(margin=dict(t=50, l=25, r=25, b=25), height=800, 
                  title="Frequency of Wins for Teams Across Different Slots (Women Dataset)")

fig.show()

In [42]:
# Visualize the frequency of wins for teams across different regions (Men dataset)
submission_M_merged['Region'] = [team[:1] for team in list(submission_M_merged['Team'].values)]
sub_grouped = submission_M_merged.groupby(['Region', 'TeamInfo'])['Team'].size().reset_index(name='numWins')

fig = px.sunburst(sub_grouped, path=['Region', 'TeamInfo'], values='numWins',
                  color='numWins', hover_data=['Region'])

fig.update_layout(margin=dict(t=50, l=25, r=25, b=25), height=800, 
                  title="Frequency of Wins for Teams Across Different Slots (Women Dataset)")

fig.show()

# Feature Selection for Men Dataset

### Read the Women Files

In [43]:
Seeds_Women = pd.read_csv(path + "WNCAATourneySeeds.csv")
Seeds_Women = Seeds_Women.drop_duplicates()

Season_Women = pd.read_csv(path + "WRegularSeasonCompactResults.csv")
Season_Women = Season_Women.iloc[:, :-2]
Season_Women = Season_Women.drop_duplicates()
Season_Women.drop(columns = ['DayNum'], inplace = True) # it did not significantly improve the accuracy

In [44]:
Season_Women.head()

Unnamed: 0,Season,WTeamID,WScore,LTeamID,LScore
0,1998,3104,91,3202,41
1,1998,3163,87,3221,76
2,1998,3222,66,3261,59
3,1998,3307,69,3365,62
4,1998,3349,115,3411,35


In [45]:
Season_Women.tail()

Unnamed: 0,Season,WTeamID,WScore,LTeamID,LScore
131582,2024,3465,75,3372,74
131583,2024,3179,76,3283,75
131584,2024,3180,68,3392,60
131585,2024,3221,61,3131,55
131586,2024,3357,69,3478,48


In [46]:
Season_Women.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 131587 entries, 0 to 131586
Data columns (total 5 columns):
 #   Column   Non-Null Count   Dtype
---  ------   --------------   -----
 0   Season   131587 non-null  int64
 1   WTeamID  131587 non-null  int64
 2   WScore   131587 non-null  int64
 3   LTeamID  131587 non-null  int64
 4   LScore   131587 non-null  int64
dtypes: int64(5)
memory usage: 5.0 MB


In [47]:
Season_Women.isnull().sum()

Season     0
WTeamID    0
WScore     0
LTeamID    0
LScore     0
dtype: int64

### Feature Engineering

In [48]:
# we get a list of team ids in 2024 for prediction: TeamID_list
df, TeamID_list = data_extraction(Seeds_Women, Season_Women)
df, grouped_df_Women = Feature_Engineering(df)
grouped_df_Women = grouped_df_Women.fillna(0) # for teams with std of wins and losses of zero
grouped_df_Women = Team_grade(grouped_df_Women)
df_Women = Feature_Adder(df, grouped_df_Women, TeamID_list)

### Correlation

In [49]:
matrix_corr = df_Women.drop(columns = ['WTeamID', 'LTeamID'], axis = 1).corr()

fig = px.imshow(matrix_corr,
                labels=dict(x="Columns", y="Columns", color="Correlation Coefficient"),
                x=matrix_corr.columns,
                y=matrix_corr.columns,
                width=800, height=800,
                title="Correlation Matrix")

fig.update_xaxes(side="top")
fig.show()

### Logit for Feature Selection

In [50]:
df = Proliferating_Data(df_Women)

X = sm.add_constant(df.drop(columns = 'WTeamID_result', axis = 1))
y = df['WTeamID_result']

# Fit a logistic regression model using statsmodels
model = sm.Logit(y, X)
result_logit = model.fit()
result_logit.summary()

Optimization terminated successfully.
         Current function value: 0.411815
         Iterations 7


0,1,2,3
Dep. Variable:,WTeamID_result,No. Observations:,86650.0
Model:,Logit,Df Residuals:,86636.0
Method:,MLE,Df Model:,13.0
Date:,"Sat, 06 Apr 2024",Pseudo R-squ.:,0.4059
Time:,01:37:06,Log-Likelihood:,-35684.0
converged:,True,LL-Null:,-60061.0
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
const,-1.941e-13,2.422,-8.01e-14,1.000,-4.747,4.747
Season,9.913e-17,0.001,8.34e-14,1.000,-0.002,0.002
WTeamID,-6.125e-05,9.31e-05,-0.658,0.511,-0.000,0.000
LTeamID,6.125e-05,9.31e-05,0.658,0.511,-0.000,0.000
WNumWins,-0.0116,0.002,-5.279,0.000,-0.016,-0.007
WNumLosses,0.0996,0.002,42.272,0.000,0.095,0.104
WTeam_grade,0.0005,9.46e-05,5.126,0.000,0.000,0.001
WstdWins,0.3037,0.004,70.051,0.000,0.295,0.312
WstdLosses,-0.2936,0.003,-91.152,0.000,-0.300,-0.287


In [51]:
result, X_new = BackwardElemination(dataframe = X, dependent_variable = y)
result.summary()

0,1,2,3
Dep. Variable:,WTeamID_result,No. Observations:,86650.0
Model:,Logit,Df Residuals:,86640.0
Method:,MLE,Df Model:,9.0
Date:,"Sat, 06 Apr 2024",Pseudo R-squ.:,0.4059
Time:,01:37:07,Log-Likelihood:,-35684.0
converged:,True,LL-Null:,-60061.0
Covariance Type:,nonrobust,LLR p-value:,0.0

0,1,2,3,4,5,6
,coef,std err,z,P>|z|,[0.025,0.975]
WNumWins,-0.0116,0.002,-5.398,0.000,-0.016,-0.007
WNumLosses,0.0996,0.002,42.675,0.000,0.095,0.104
WTeam_grade,0.0005,9.41e-05,5.149,0.000,0.000,0.001
WstdWins,0.3036,0.004,70.052,0.000,0.295,0.312
WstdLosses,-0.2936,0.003,-93.115,0.000,-0.300,-0.287
LNumWins,0.0116,0.002,5.398,0.000,0.007,0.016
LNumLosses,0.2936,0.003,93.115,0.000,0.287,0.300
LstdWins,-0.0996,0.002,-42.675,0.000,-0.104,-0.095
LstdLosses,-0.3036,0.004,-70.052,0.000,-0.312,-0.295


In [52]:
conf = np.exp(result.conf_int())
params = np.exp(result.params)
conf['OR'] = params
pvalue=round(result.pvalues,3)
conf['pvalue'] = pvalue
conf.columns = ['CI 95%(2.5%)', 'CI 95%(97.5%)', 'Odds Ratio','pvalue']
print(conf)

             CI 95%(2.5%)  CI 95%(97.5%)  Odds Ratio  pvalue
WNumWins         0.984293       0.992629    0.988452     0.0
WNumLosses       1.099637       1.109739    1.104677     0.0
WTeam_grade      1.000300       1.000669    1.000484     0.0
WstdWins         1.343328       1.366348    1.354789     0.0
WstdLosses       0.740948       0.750165    0.745542     0.0
LNumWins         1.007425       1.015958    1.011683     0.0
LNumLosses       1.333041       1.349622    1.341305     0.0
LstdWins         0.901113       0.909391    0.905242     0.0
LstdLosses       0.731878       0.744420    0.738122     0.0
LTeam_grade      0.999332       0.999700    0.999516     0.0


### Logistic Regression for Men Dataset and Predicting Winers of 2024 

In [53]:
scaler = StandardScaler()
data = X.copy()

num_cols = ['WNumWins','WNumLosses','WstdWins','WstdLosses','WTeam_grade','LNumWins','LNumLosses','LstdWins','LstdLosses','LTeam_grade']
cat_cols = ['WTeamID', 'LTeamID']

data['WTeamID_result'] = y
data = pd.get_dummies(data, columns=cat_cols, drop_first=True)
data[num_cols] = scaler.fit_transform(data[num_cols])

data_test = data.query('Season == 2024').reset_index(drop = True)
data_train = data.drop(data.query('Season == 2024').index, axis = 0).reset_index(drop = True)

y_test = data_test['WTeamID_result'].reset_index(drop = True)
y_train = data_train['WTeamID_result'].reset_index(drop = True)

X_test = data_test.drop(columns = 'WTeamID_result', axis = 1)
X_train = data_train.drop(columns = 'WTeamID_result', axis = 1)

In [54]:
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

In [55]:
cm = confusion_matrix(y_test, y_pred)
conf_matrix = pd.DataFrame(data = cm, columns = ["Predicted: 0","Predicted: 1"], index = ["Actual: 0", "Actual: 1"])

In [56]:
fig = px.imshow(conf_matrix,
                labels=dict(x="Predicted Values", y="Actual Values", color="Values"),
                x=["Negative (0)", "Positive (1)"],
                y=["Negative (0)", "Positive (1)"],
                color_continuous_scale='Blues',  # Adjust color scale if needed
                width=600, height=600,
                text_auto=True,
                title="Confusion Matrix")

fig.update_xaxes(side="top")
fig.show()

In [57]:
TN = cm[0,0]
TP = cm[1,1]
FN = cm[1,0]
FP = cm[0,1]


print("------------------------------  The Statistics Results  ------------------------------\n")
print("Accuracy: {}".format(round((TP+TN)/(TP+TN+FP+FN), 4)))
print("Missclassification: {}".format(round(1 - (TP+TN)/(TP+TN+FP+FN), 4)))
print("Recall/Sensitivity/True Positive Rate (TPR): {}".format(round((TP)/(TP+FP), 4)))
print("Specificity/Discriminant Power/True Negative Rate (TNR): {}".format(round((TN)/(FP+TN), 4)))
print("Positive Predictive Value (PPV)/Precision/precision of the positive class: {}".format(round((TP)/(TP+FP), 4)))
print("Negative Predictive Value (NPV)/precision of the negative class: {}".format(round((TN)/(TN+FN), 4)))
sensitivity = TP/(TP+FN)
specificity = TN/(TN+FP)
print("Positive Likelihood Ratio: {}".format(round(sensitivity/(1 - specificity), 4)))
print("Negative Likelihood Ratio: {}".format(round((1 - sensitivity)/ specificity, 4)))
print("-------------------------------------------------------------------------------------\n")

------------------------------  The Statistics Results  ------------------------------

Accuracy: 0.8277
Missclassification: 0.1723
Recall/Sensitivity/True Positive Rate (TPR): 0.8338
Specificity/Discriminant Power/True Negative Rate (TNR): 0.8368
Positive Predictive Value (PPV)/Precision/precision of the positive class: 0.8338
Negative Predictive Value (NPV)/precision of the negative class: 0.8218
Positive Likelihood Ratio: 5.0169
Negative Likelihood Ratio: 0.2168
-------------------------------------------------------------------------------------



In [58]:
y_pred_prob_yes = model.predict_proba(X_test)[:, 1]
FPR, TPR, thresholds = roc_curve(y_test, y_pred_prob_yes.reshape(-1, 1))
roc_data = pd.DataFrame({
    'FPR': FPR,
    'TPR': TPR
})
fig = px.line(roc_data, x='FPR', y='TPR',
              title='ROC curve for MM 2024',
              labels={'FPR': 'False positive rate (1-Specificity)', 'TPR': 'True positive rate (Sensitivity)'},
              width=500, height=500)
fig.show()

# Simulate Brackets for Teams in 2024 (Women Dataset)

### Identify the winners of the play-in games and eliminate the losers

In [59]:
# Prepare the play-in matches to forecast which ones will win
Team_A, Team_B, play_in_seeds, play_in_TeamID_ls = play_in_teams(Seeds_Women)
play_in_data = play_in_data_generator(Team_A, Team_B, play_in_TeamID_ls)
play_in_df = Feature_Adder(play_in_data, grouped_df_Women, play_in_TeamID_ls)

In [60]:
# One-hot encoding and standardization
data = X.drop(columns = ['const', 'Season'], axis = 1).copy()
data = data.drop(X.query('Season == 2024').index).reset_index(drop = True)
play_in_df = play_in_df.drop(columns = 'Season', axis = 1)
data = pd.concat([data, play_in_df], ignore_index = True)
data = pd.get_dummies(data, columns=['WTeamID', 'LTeamID'], drop_first=True)
data[num_cols] = scaler.fit_transform(data[num_cols])
play_in_df = data.iloc[-play_in_df.shape[0]:, :].reset_index(drop = True)
data = data.iloc[:-play_in_df.shape[0], :]

In [61]:
# Re-fit after discarding constants and the 'Season' variable
model.fit(data, y_train)

In [62]:
y_pred_play_in = model.predict(play_in_df)
pd.DataFrame({'Team A': Team_A, 'Team B': Team_B, 'Wining Team A?': y_pred_play_in})

Unnamed: 0,Team A,Team B,Wining Team A?
0,3342,3357,0
1,3221,3404,1
2,3112,3120,1
3,3162,3435,1


In [63]:
# Removing losers from the play-in games
seed_df = Seeds_Women.query('Season == 2024').reset_index(drop=True)
seeds_cleaned = Removing_Play_in_Lossers(seed_df)
R_seed = seeds_cleaned

### Simulation

In [64]:

'''
    Here is the simulation part and where the NUMBER OF BRACKETS is defined.
    
'''
num_brack = NUMBER_OF_BRACKETS
submission_W = pd.DataFrame(data = np.array(np.zeros((63*num_brack, 4))), columns = ['Tournament', 'Bracket', 'Slot', 'Team'])
submission_W['Slot'] = slots * num_brack
submission_W['Tournament'] = 'W'


seeded_team = seeding(seeds_cleaned)
R_data = Feature_Adder(seeded_team.drop(columns = ['Seed_Team_A', 'Seed_Team_B'], axis = 1), grouped_df_Women, list(np.concatenate([seeded_team['WTeamID'].values, seeded_team['LTeamID'].values])))
R_data['WTeamID_result'] = 0

# proliferate
data_1985_2023 = df_Women.drop(df_Women.query('Season == 2024').index)
proliferated_df = Proliferating_Data(data_1985_2023)

# shuffle
M, N = proliferated_df.shape
rand_idx = np.random.choice(M, (1,M), replace = False).squeeze()
proliferated_df = proliferated_df.iloc[rand_idx, :].reset_index(drop = True)

# Scaling and OneHotEncoding
df = pd.concat([proliferated_df, R_data]).reset_index(drop = True)
df = pd.get_dummies(df, columns=cat_cols, drop_first=True)

# split data
X_test_temp = df.query('Season == 2024')
X_train_temp = df.drop(X_test_temp.index).reset_index(drop = True)

X_test = X_test_temp.drop(columns = ['Season', 'WTeamID_result'], axis = 1).reset_index(drop = True)
X_train = X_train_temp.drop(columns = ['Season', 'WTeamID_result'], axis = 1).reset_index(drop = True)
y_train = X_train_temp['WTeamID_result'].reset_index(drop = True)

# Scaling
scaler.fit(X_train)
X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
X_test[num_cols] = scaler.fit_transform(X_test[num_cols])

model.fit(X_train, y_train)


for bracket in np.arange(1, num_brack+1):
    if (bracket + 1) % 10 == 0:
        print('Simulating bracket {} of {} ...'.format((bracket + 1), num_brack))
    R_seed = seeds_cleaned
    SEEDs = []
    for Round in np.arange(1, 7):
        seeded_team = seeding(R_seed)
        R_data = Feature_Adder(seeded_team.drop(columns = ['Seed_Team_A', 'Seed_Team_B'], axis = 1), grouped_df_Women, list(np.concatenate([seeded_team['WTeamID'].values, seeded_team['LTeamID'].values])))
        R_data['WTeamID_result'] = 0
        #Fitting
        R_data.drop(columns = ['WTeamID_result'], axis = 1, inplace = True)
        X_test_copy = X_train.iloc[:R_data.shape[0], :].reset_index(drop = True)

        cols_to_drop = ['WNumWins', 'WNumLosses', 'WTeam_grade', 'WstdWins', 'WstdLosses',
                        'LNumWins', 'LNumLosses', 'LstdWins', 'LstdLosses', 'LTeam_grade']
        X_test_copy.drop(columns = cols_to_drop, axis = 1, inplace = True)
        
        X_test_copy.iloc[:, :] = False
        X_test = pd.concat([R_data, X_test_copy], axis = 1)

        # Select rows where either WTeamID or LTeamID matches with the encoded columns name and set them to True
        for i in np.arange(len(X_test)):
            X_test.loc[i, ['WTeamID_{}'.format(X_test.WTeamID[i]), 'LTeamID_{}'.format(X_test.LTeamID[i])]] = True

        X_test.drop(columns = ['Season', 'WTeamID', 'LTeamID'], axis = 1, inplace = True)
                        
        R_pred = model.predict(X_test)
        seeded_result = seeded_team.copy()
        seeded_result['Result'] = R_pred.reshape(-1, 1)

        R_seed = SLOT_df(seeded_result)
        SEEDs.extend(list(R_seed.loc[:, 'Seed'].values))
    submission_W.loc[(bracket-1)*63:bracket*63-1, 'Team'] = SEEDs
    submission_W.loc[(bracket-1)*63:bracket*63-1, 'Bracket'] = int(bracket)

submission_W['Bracket'] = submission_W['Bracket'].astype(int)

Simulating bracket 10 of 1000 ...
Simulating bracket 20 of 1000 ...
Simulating bracket 30 of 1000 ...
Simulating bracket 40 of 1000 ...
Simulating bracket 50 of 1000 ...
Simulating bracket 60 of 1000 ...
Simulating bracket 70 of 1000 ...
Simulating bracket 80 of 1000 ...
Simulating bracket 90 of 1000 ...
Simulating bracket 100 of 1000 ...
Simulating bracket 110 of 1000 ...
Simulating bracket 120 of 1000 ...
Simulating bracket 130 of 1000 ...
Simulating bracket 140 of 1000 ...
Simulating bracket 150 of 1000 ...
Simulating bracket 160 of 1000 ...
Simulating bracket 170 of 1000 ...
Simulating bracket 180 of 1000 ...
Simulating bracket 190 of 1000 ...
Simulating bracket 200 of 1000 ...
Simulating bracket 210 of 1000 ...
Simulating bracket 220 of 1000 ...
Simulating bracket 230 of 1000 ...
Simulating bracket 240 of 1000 ...
Simulating bracket 250 of 1000 ...
Simulating bracket 260 of 1000 ...
Simulating bracket 270 of 1000 ...
Simulating bracket 280 of 1000 ...
Simulating bracket 290 of 100

In [65]:
WTeams = pd.read_csv(path + 'WTeams.csv')

# Merge the datasets on the 'Team' column in submission_W and the 'Seed' column in seeds_cleaned
submission_W_merged = submission_W.copy()
submission_W_merged = pd.merge(submission_W_merged, seeds_cleaned, left_on='Team', right_on='Seed', how='left')
submission_W_merged.drop(['Seed', 'Season'], axis=1, inplace=True)
submission_W_merged = pd.merge(submission_W_merged, WTeams, on='TeamID', how='left')

submission_W_merged['TeamInfo'] = submission_W_merged.Team+'('+ submission_W_merged.TeamName+')'
submission_W_merged

Unnamed: 0,Tournament,Bracket,Slot,Team,TeamID,TeamName,TeamInfo
0,W,1,R1W1,W03,3333,Oregon St,W03(Oregon St)
1,W,1,R1W2,W14,3186,E Washington,W14(E Washington)
2,W,1,R1W3,W04,3231,Indiana,W04(Indiana)
3,W,1,R1W4,W10,3266,Marquette,W10(Marquette)
4,W,1,R1W5,W09,3277,Michigan St,W09(Michigan St)
...,...,...,...,...,...,...,...
62995,W,1000,R4Y1,Y10,3424,UNLV,Y10(UNLV)
62996,W,1000,R4Z1,Z12,3162,Columbia,Z12(Columbia)
62997,W,1000,R5WX,Z12,3162,Columbia,Z12(Columbia)
62998,W,1000,R5YZ,W03,3333,Oregon St,W03(Oregon St)


### Visualize Number of Wins Per Region for Different Teams Across Various Possible Brackets

In [66]:
submission_W_merged['SLOT'] = [slot[:2] for slot in slots]*num_brack
sub_grouped = submission_W_merged.groupby(['SLOT', 'TeamInfo'])['Team'].size().reset_index(name='numWins')

# Create and display treemap
fig = px.treemap(sub_grouped, path=["SLOT", "TeamInfo"],
                 values='numWins', color='numWins',
                 hover_data=['TeamInfo'], color_continuous_scale="Greens",
                 )

# Update layout and display treemap
fig.update_layout(margin=dict(t=50, l=25, r=25, b=25), height=1000, title="Number of Wins Per Region for Different Teams Across Various Possible Brackets")
fig.show()

In [67]:
# Visualize the frequency of wins for teams across different slots (Women dataset)
fig = px.sunburst(sub_grouped, path=['SLOT', 'TeamInfo'], values='numWins',
                  color='numWins', hover_data=['SLOT'])

fig.update_layout(margin=dict(t=50, l=25, r=25, b=25), height=800, 
                  title="Frequency of Wins for Teams Across Different Slots (Women Dataset)")

fig.show()

In [68]:
# Visualize the frequency of wins for teams across different regions (Women dataset)
submission_W_merged['Region'] = [team[:1] for team in list(submission_W_merged['Team'].values)]
sub_grouped = submission_W_merged.groupby(['Region', 'TeamInfo'])['Team'].size().reset_index(name='numWins')

fig = px.sunburst(sub_grouped, path=['Region', 'TeamInfo'], values='numWins',
                  color='numWins', hover_data=['Region'])

fig.update_layout(margin=dict(t=50, l=25, r=25, b=25), height=800, 
                  title="Frequency of Wins for Teams Across Different Slots (Women Dataset)")

fig.show()

# Submission

In [69]:
submission = pd.concat([submission_M, submission_W]).rename_axis('RowId')
submission

Unnamed: 0_level_0,Tournament,Bracket,Slot,Team
RowId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,M,1,R1W1,W06
1,M,1,R1W2,W08
2,M,1,R1W3,W16
3,M,1,R1W4,W13
4,M,1,R1W5,W11
...,...,...,...,...
62995,W,1000,R4Y1,Y10
62996,W,1000,R4Z1,Z12
62997,W,1000,R5WX,Z12
62998,W,1000,R5YZ,W03


In [70]:
submission.to_csv('submission.csv', index=False)