#### What are you trying to do in this notebook?
- We should submit predicted probabilities for every possible matchup in the past 5 NCAA® tournaments (2016-2019 and 2021). Note that there was no tournament held in 2020.
- We should submit predicted probabilities for every possible matchup before the 2022 tournament begins.

For each team at each season, I compute :

- Number of wins
- Number of losses
- Average score gap of wins
- Average score gap of losses

And use the following features :

- Win Ratio
- Average score gap

#### Why are you trying it?
The file identifies the seeds for all teams in each NCAA® tournament, for all seasons of historical data. Thus, there are between 64-68 rows for each year, depending on whether there were any play-in games and how many there were. In recent years the structure has settled at 68 total teams, with four "play-in" games leading to the final field of 64 teams entering Round 1 on Thursday of the first week (by definition, that is DayNum=136 each season). We will not know the seeds of the respective tournament teams, or even exactly which 68 teams it will be, until Selection Sunday on March 15, 2020 (DayNum=132).

The seed is a 3/4-character :

- First character : Region (W, X, Y, or Z)
- Next two digits : Seed within the region (01 to 16)
- Last character (optional): Distinguishes teams between play-ins ( a or b)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import numpy as np
import pandas as pd
from sklearn import *
import glob

f = {f.split('/')[-1]: pd.read_csv(f, encoding='latin1') for f in glob.glob('/*/*/w*-m*2022/WD*1/**')}
#for df in f: print(df, list(f[df].columns))

In [None]:
teams = f['WTeams.csv']
teams2 = f['WTeamSpellings.csv']
season_cresults = f['WRegularSeasonCompactResults.csv']
season_dresults = f['WRegularSeasonDetailedResults.csv']
tourney_cresults = f['WNCAATourneyCompactResults.csv']
tourney_dresults = f['WNCAATourneyDetailedResults.csv']
slots = f['WNCAATourneySlots.csv']
seeds = f['WNCAATourneySeeds.csv']
seeds = {'_'.join(map(str,[int(k1),k2])):int(v[1:3]) for k1, v, k2 in seeds[['Season', 'Seed', 'TeamID']].values}
seeds = {**seeds, **{k.replace('2021_','2022_'):seeds[k] for k in seeds if '2021_' in k}}
cities = f['Cities.csv']
gcities = f['WGameCities.csv']
seasons = f['WSeasons.csv']
sub = f['WSampleSubmissionStage1.csv']

In [None]:
teams2 = teams2.groupby(by='TeamID', as_index=False)['TeamNameSpelling'].count()
teams2.columns = ['TeamID', 'TeamNameCount']
teams = pd.merge(teams, teams2, how='left', on=['TeamID'])
del teams2

In [None]:
season_cresults['ST'] = 'S'
season_dresults['ST'] = 'S'
tourney_cresults['ST'] = 'T'
tourney_dresults['ST'] = 'T'
#games = pd.concat((season_cresults, tourney_cresults), axis=0, ignore_index=True)
games = pd.concat((season_dresults, tourney_dresults), axis=0, ignore_index=True)
games.reset_index(drop=True, inplace=True)
games['WLoc'] = games['WLoc'].map({'A': 1, 'H': 2, 'N': 3})

In [None]:
games['ID'] = games.apply(lambda r: '_'.join(map(str, [r['Season']]+sorted([r['WTeamID'],r['LTeamID']]))), axis=1)
games['IDTeams'] = games.apply(lambda r: '_'.join(map(str, sorted([r['WTeamID'],r['LTeamID']]))), axis=1)
games['Team1'] = games.apply(lambda r: sorted([r['WTeamID'],r['LTeamID']])[0], axis=1)
games['Team2'] = games.apply(lambda r: sorted([r['WTeamID'],r['LTeamID']])[1], axis=1)
games['IDTeam1'] = games.apply(lambda r: '_'.join(map(str, [r['Season'], r['Team1']])), axis=1)
games['IDTeam2'] = games.apply(lambda r: '_'.join(map(str, [r['Season'], r['Team2']])), axis=1)

In [None]:
games['Team1Seed'] = games['IDTeam1'].map(seeds).fillna(0)
games['Team2Seed'] = games['IDTeam2'].map(seeds).fillna(0)

In [None]:
games['ScoreDiff'] = games['WScore'] - games['LScore']
games['Pred'] = games.apply(lambda r: 1. if sorted([r['WTeamID'],r['LTeamID']])[0]==r['WTeamID'] else 0., axis=1)
games['ScoreDiffNorm'] = games.apply(lambda r: r['ScoreDiff'] * -1 if r['Pred'] == 0. else r['ScoreDiff'], axis=1)
games['SeedDiff'] = games['Team1Seed'] - games['Team2Seed'] 
games = games.fillna(-1)

In [None]:
c_score_col = ['NumOT', 'WFGM', 'WFGA', 'WFGM3', 'WFGA3', 'WFTM', 'WFTA', 'WOR', 'WDR', 'WAst', 'WTO', 'WStl',
 'WBlk', 'WPF', 'LFGM', 'LFGA', 'LFGM3', 'LFGA3', 'LFTM', 'LFTA', 'LOR', 'LDR', 'LAst', 'LTO', 'LStl',
 'LBlk', 'LPF']
c_score_agg = ['sum', 'mean', 'median', 'max', 'min', 'std', 'skew', 'nunique']
gb = games.groupby(by=['IDTeams']).agg({k: c_score_agg for k in c_score_col}).reset_index()
gb.columns = [''.join(c) + '_c_score' for c in gb.columns]

games = games[games['ST']=='T']

In [None]:
sub['WLoc'] = 3
sub['Season'] = sub['ID'].map(lambda x: x.split('_')[0])
sub['Season'] = sub['ID'].map(lambda x: x.split('_')[0])
sub['Season'] = sub['Season'].astype(int)
sub['Team1'] = sub['ID'].map(lambda x: x.split('_')[1])
sub['Team2'] = sub['ID'].map(lambda x: x.split('_')[2])
sub['IDTeams'] = sub.apply(lambda r: '_'.join(map(str, [r['Team1'], r['Team2']])), axis=1)
sub['IDTeam1'] = sub.apply(lambda r: '_'.join(map(str, [r['Season'], r['Team1']])), axis=1)
sub['IDTeam2'] = sub.apply(lambda r: '_'.join(map(str, [r['Season'], r['Team2']])), axis=1)
sub['Team1Seed'] = sub['IDTeam1'].map(seeds).fillna(0)
sub['Team2Seed'] = sub['IDTeam2'].map(seeds).fillna(0)
sub['SeedDiff'] = sub['Team1Seed'] - sub['Team2Seed'] 
sub = sub.fillna(-1)

In [None]:
games = pd.merge(games, gb, how='left', left_on='IDTeams', right_on='IDTeams_c_score')
sub = pd.merge(sub, gb, how='left', left_on='IDTeams', right_on='IDTeams_c_score')

col = [c for c in games.columns if c not in ['ID', 'DayNum', 'ST', 'Team1', 'Team2', 'IDTeams', 'IDTeam1', 'IDTeam2', 'WTeamID', 'WScore', 'LTeamID', 'LScore', 'NumOT', 'Pred', 'ScoreDiff', 'ScoreDiffNorm', 'WLoc'] + c_score_col]

In [None]:
reg = linear_model.LinearRegression()
reg.fit(games[col].fillna(-1), games['Pred'])
pred = reg.predict(games[col].fillna(-1)).clip(0,1)
print('Log Loss:', metrics.log_loss(games['Pred'], pred))
sub['Pred'] = reg.predict(sub[col].fillna(-1)).clip(0.000002,0.999998)
sub[['ID', 'Pred']].to_csv('submission.csv', index=False)

#### Did it work?
The file identifies the game-by-game results for many seasons of historical data, starting with the 1985 season (the first year the NCAA® had a 64-team tournament). For each season, the file includes all games played from DayNum 0 through 132. It is important to realize that the "Regular Season" games are simply defined to be all games played on DayNum=132 or earlier (DayNum=132 is Selection Sunday, and there are always a few conference tournament finals actually played early in the day on Selection Sunday itself). Thus a game played on or before Selection Sunday will show up here whether it was a pre-season tournament, a non-conference game, a regular conference game, a conference tournament game, or whatever.

#### What did you not understand about this process?
Well everything is understandable. 
Risky strategy -
- Picked 11 teams that would win their first match
- Stanford and Baylor beat every team seeded 3 or higher
- Connecticut and South Carolina beat every team seeded 4 or higher
- Maryland wins beats every team seeded 7 or higher
- use p=0.99999 for overriding

Safe strategy
- Picked 7 teams that would win their first match
- Stanford, Connecticut and South Carolina beat every team seeded 6 or higher
- Baylor wins beats every team seeded 7 or higher
- use p=0.99 for overriding

*This parts needs to be updated using this year's best teams. I'm waiting to get analysts' insights for this. Also, this was for the women teams so don't be this aggressive with the men competition as there usually are more upsets.*

#### What else do you think you can try as part of this approach?
The file identifies the game-by-game NCAA® tournament results for all seasons of historical data. The data is formatted exactly like the MRegularSeasonCompactResults data. All games will show up as neutral site (so WLoc is always N). Note that this tournament game data also includes the play-in games (which always occurred on day 134/135) for those years that had play-in games. Thus each season you will see between 63 and 67 games listed, depending on how many play-in games there were.
