# **<span style='color:#A80808'>🎯 Goal</span>**

Pick the winners and losers using a combination of rich historical data.

# **<span style='color:#A80808'>🔑 Metric</span>**

Submissions are scored on the log loss: $LogLoss = -\frac{1}{n}\sum_{i=1}^{n}[y_ilog(\hat{y}_i)+(1-y_i)log(1-\hat{y}_i)]$

where:

* $n$ is the number of games played
* $\hat{y}_i$ is the predicted probability of team 1 beating team 2
* $y_i$ is 1 if team 1 wins, 0 if team 2 wins
* $log$ is the natural logarithm

The use of the logarithm provides extreme punishments for being both confident and wrong. In the worst possible case, a prediction that something is true when it is actually false will add an infinite amount to your error score. In order to prevent this, predictions are bounded away from the extremes by a small value.

# **<span style='color:#A80808'>💾 Data</span>**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import random

from sklearn.ensemble import RandomForestRegressor as rfr

import warnings
warnings.simplefilter('ignore')

# **<span style='color:#A80808'>Data Section 1 file: WSeasons.csv</span>**

This file identifies the different seasons included in the historical data, along with certain season-level properties.

* Season - indicates the year in which the tournament was played. Remember that the current season counts as 2022.
* DayZero - tells you the date corresponding to DayNum=0 during that season. All game dates have been aligned upon a common scale so that (each year) Selection Monday is on day 133. All game data includes the day number in order to make it easier to perform date calculations. If you need to know the exact date a game was played on, you can combine the game's "DayNum" with the season's "DayZero". For instance, since day zero during the 2011-2012 season was 10/31/2011, if we know that the earliest regular season games that year were played on DayNum=7, they were therefore played on 11/07/2011.
* RegionW, RegionX, Region Y, Region Z - by convention, the four regions in the final tournament are always named W, X, Y, and Z. Whichever region's name comes first alphabetically, that region will be Region W. And whichever Region plays against Region W in the national semifinals, that will be Region X. For the other two regions, whichever region's name comes first alphabetically, that region will be Region Y, and the other will be Region Z. This allows us to identify the regions and brackets in a standardized way in other files. For instance, during the 2012 tournament, the four regions were DesMoines, Fresno, Kingston, and Raleigh. Being the first alphabetically, DesMoines becomes W. Since the Fresno regional champion (Stanford) played against the DesMoines regional champion (Baylor) in the national semifinals, that makes Fresno be region X. For the other two (Kingston and Raleigh), since Kingston comes first alphabetically, that makes Kingston Y and therefore Raleigh is Z. So for that season, the W/X/Y/Z are DesMoines,Fresno,Kingston,Raleigh. And so for instance, Baylor, the #1 seed in the DesMoines region, is listed in the WNCAATourneySeeds file with a seed of W01, meaning they were the #1 seed in the W region (the DesMoines region). We will not know the final W/X/Y/Z designations until Selection Monday, because the national semifinal pairings in the Final Four will depend upon the overall ranks of the four #1 seeds.

The game dates in this dataset are expressed in relative terms, as the number of days since the start of the regular season, and aligned for each season so that day number #133 is the Monday right before the tournament, when team selections are made. During any given season, day number zero is defined to be exactly 19 weeks earlier than Selection Monday, so Day #0 is a Monday in late October or early November such that Day #132 is Selection Sunday (for the men's tournament) and Day #133 is Selection Monday (for the women's tournament).

This doesn't necessarily mean that the regular season will always start exactly on day #0 or day #1; in fact, during the past decade, regular season games typically start being played on a Friday that is either Day #4 or Day #11, but further back there was more variety.

In [None]:
WSeasons = pd.read_csv('../input/womens-march-mania-2022/WDataFiles_Stage1/WSeasons.csv')
WSeasons

# **<span style='color:#A80808'>Data Section 1 file: WRegularSeasonCompactResults.csv</span>**

This file identifies the game-by-game results for many seasons of historical data, starting with the 1998 season. For each season, the file includes all games played from DayNum 0 through 132. It is important to realize that the "Regular Season" games are simply defined to be all games played on DayNum=132 or earlier (DayNum=133 is Selection Monday). Thus a game played before Selection Monday will show up here whether it was a pre-season tournament, a non-conference game, a regular conference game, a conference tournament game, or whatever.

* Season - this is the year of the associated entry in WSeasons.csv (the year in which the final tournament occurs). For example, during the 2016 season, there were regular season games played between November 2015 and March 2016, and all of those games will show up with a Season of 2016.
* DayNum - this integer always ranges from 0 to 132, and tells you what day the game was played on. It represents an offset from the "DayZero" date in the "WSeasons.csv" file. For example, the first game in the file was DayNum=18. Combined with the fact from the "WSeasons.csv" file that day zero was 10/27/1997 that year, this means the first game was played 18 days later, or 11/14/1997. There are no teams that ever played more than one game on a given date, so you can use this fact if you need a unique key (combining Season and DayNum and WTeamID).
* WTeamID - this identifies the id number of the team that won the game, as listed in the "WTeams.csv" file. No matter whether the game was won by the home team or visiting team, or if it was a neutral-site game, the "WTeamID" always identifies the winning team.
* WScore - this identifies the number of points scored by the winning team.
* LTeamID - this identifies the id number of the team that lost the game.
* LScore - this identifies the number of points scored by the losing team. Thus you can be confident that WScore will be greater than LScore for all games listed.
* NumOT - this indicates the number of overtime periods in the game, an integer 0 or higher.
* WLoc - this identifies the "location" of the winning team. If the winning team was the home team, this value will be "H". If the winning team was the visiting team, this value will be "A". If it was played on a neutral court, then this value will be "N". Sometimes it is unclear whether the site should be considered neutral, since it is near one team's home court, or even on their court during a tournament, but for this determination we have simply used the Kenneth Massey data in its current state, where the "@" sign is either listed with the winning team, the losing team, or neither team. If you would like to investigate this factor more closely, we invite you to explore Data Section 3, which provides the city that each game was played in, irrespective of whether it was considered to be a neutral site.

In [None]:
WRegularSeasonCompactResults =pd.read_csv('../input/womens-march-mania-2022/WDataFiles_Stage1/WRegularSeasonCompactResults.csv')
WRegularSeasonCompactResults

Get DayZero from WSeason then convert DayNum to datetime

In [None]:
WRegularSeasonCompactResults['Time'] = pd.to_datetime(WRegularSeasonCompactResults['Season'].map(WSeasons.set_index('Season')['DayZero']))
WRegularSeasonCompactResults['Time'] += pd.to_timedelta(WRegularSeasonCompactResults.DayNum, unit='D') 
WRegularSeasonCompactResults.head(2)

## Winner teams

In [None]:
print(f'There are {WRegularSeasonCompactResults.WTeamID.nunique()} unique winner id:')
print(f'{np.sort(WRegularSeasonCompactResults.WTeamID.unique())}')

In [None]:
plt.figure(figsize=(10,7))
WRegularSeasonCompactResults.groupby('WTeamID').size().hist(bins=100, color='orange')
plt.xlabel('Distribution of the number of matches played by the winner teams')
plt.ylabel('Frequence')
plt.show()

## Loser teams

In [None]:
print(f'There are {WRegularSeasonCompactResults.LTeamID.nunique()} unique loser id:')
print(f'{np.sort(WRegularSeasonCompactResults.LTeamID.unique())}')

Number of unique winners = number of unique losers => No team always win or lose.

In [None]:
plt.figure(figsize=(10,7))
WRegularSeasonCompactResults.groupby('LTeamID').size().hist(bins=100, color='orange')
plt.xlabel('Distribution of the number of matches played by the loser teams')
plt.ylabel('Frequence')
plt.show()

The distribution of the number of matches played by the losers is skewer than that of the winners.

## Winner score

In [None]:
print(f'There are {WRegularSeasonCompactResults.WScore.nunique()} unique winner scores:')
print(f'{np.sort(WRegularSeasonCompactResults.WScore.unique())}')

In [None]:
print(f'Mean and std of winner scores: mean={np.round(WRegularSeasonCompactResults.WScore.mean(), 2)}, std={np.round(WRegularSeasonCompactResults.WScore.std(), 2)}')
plt.figure(figsize=(10,7))
WRegularSeasonCompactResults.WScore.hist(bins=100, color='orange')
plt.xlabel('Winner score')
plt.ylabel('Frequence')
plt.show()

## Loser score

In [None]:
print(f'There are {WRegularSeasonCompactResults.LScore.nunique()} unique loser scores:')
print(f'{np.sort(WRegularSeasonCompactResults.LScore.unique())}')

In [None]:
print(f'Mean and std of loser scores: mean={np.round(WRegularSeasonCompactResults.LScore.mean(), 2)}, std={np.round(WRegularSeasonCompactResults.LScore.std(), 2)}')
plt.figure(figsize=(10,7))
WRegularSeasonCompactResults.LScore.hist(bins=100, color='orange')
plt.xlabel('Loser score')
plt.ylabel('Frequence')
plt.show()

## Difference between winner and loser scores

In [None]:
WRegularSeasonCompactResults['DScore'] = WRegularSeasonCompactResults.WScore - WRegularSeasonCompactResults.LScore

In [None]:
print(f'There are {WRegularSeasonCompactResults.DScore.nunique()} unique different scores:')
print(f'{np.sort(WRegularSeasonCompactResults.DScore.unique())}')

In [None]:
print(f'Mean and std of different scores: mean={np.round(WRegularSeasonCompactResults.DScore.mean(), 2)}, std={np.round(WRegularSeasonCompactResults.DScore.std(), 2)}')
plt.figure(figsize=(10,7))
WRegularSeasonCompactResults.DScore.hist(bins=100, color='orange')
plt.xlabel('Score difference')
plt.ylabel('Frequence')
plt.show()

Unlike the previousely observed standard shape of the winner and loser score distribution, score difference between the winners and the losers is skewed.

## Overtime

In [None]:
print(f'There are {WRegularSeasonCompactResults.NumOT.nunique()} unique numbers overtime:')
print(f'{np.sort(WRegularSeasonCompactResults.NumOT.unique())}')

In [None]:
pie, ax = plt.subplots(figsize=[20,12])
WRegularSeasonCompactResults.groupby('NumOT').size().plot(kind='pie',
                                                    #autopct='%.2f',
                                                    ax=ax,
                                                    title='Overtime distibution',
                                                    rotatelabels =False,
                                                    cmap = 'tab10')
plt.show()

## Winner location

In [None]:
print(f'There are {WRegularSeasonCompactResults.WLoc.nunique()} unique winner location:')
print(f'{np.sort(WRegularSeasonCompactResults.WLoc.unique())}')

In [None]:
pie, ax = plt.subplots(figsize=[20,12])
WRegularSeasonCompactResults.groupby('WLoc').size().plot(kind='pie',
                                                    #autopct='%.2f',
                                                    ax=ax,
                                                    title='Winner location distibution',
                                                    rotatelabels =False,
                                                    cmap = 'tab10')
plt.show()

Clearly, playing at home helps the winner a lot.

# **<span style='color:#A80808'>Data Section 1 file: WNCAATourneyCompactResults.csv</span>**

This file identifies the game-by-game NCAA® tournament results for all seasons of historical data. The data is formatted exactly like the WRegularSeasonCompactResults data. Each season you will see 63 games listed, since there are no women's play-in games.

Although the scheduling of the men's tournament rounds has been consistent for many years, there has been more variety in the scheduling of the women's rounds. There have been four different schedules over the course of the past 20+ years for the women's tournament, as follows:

2017 season through 2021 season:

* Round 1 = days 137/138 (Fri/Sat)
* Round 2 = days 139/140 (Sun/Mon)
* Round 3 = days 144/145 (Sweet Sixteen, Fri/Sat)
* Round 4 = days 146/147 (Elite Eight, Sun/Mon)
* National Seminfinal = day 151 (Fri)
* National Final = day 153 (Sun)

2015 season and 2016 season:

* Round 1 = days 137/138 (Fri/Sat)
* Round 2 = days 139/140 (Sun/Mon)
* Round 3 = days 144/145 (Sweet Sixteen, Fri/Sat)
* Round 4 = days 146/147 (Elite Eight, Sun/Mon)
* National Seminfinal = day 153 (Sun)
* National Final = day 155 (Tue)

2003 season through 2014 season:

* Round 1 = days 138/139 (Sat/Sun)
* Round 2 = days 140/141 (Mon/Tue)
* Round 3 = days 145/146 (Sweet Sixteen, Sat/Sun)
* Round 4 = days 147/148 (Elite Eight, Mon/Tue)
* National Seminfinal = day 153 (Sun)
* National Final = day 155 (Tue)

1998 season through 2002 season:

* Round 1 = days 137/138 (Fri/Sat)
* Round 2 = days 139/140 (Sun/Mon)
* Round 3 = day 145 only (Sweet Sixteen, Sat)
* Round 4 = day 147 only (Elite Eight, Mon)
* National Seminfinal = day 151 (Fri)
* National Final = day 153 (Sun)

In [None]:
WNCAATourneyCompactResults =pd.read_csv('../input/womens-march-mania-2022/WDataFiles_Stage1/WNCAATourneyCompactResults.csv')
WNCAATourneyCompactResults

## Winner teams

In [None]:
print(f'There are {WNCAATourneyCompactResults.WTeamID.nunique()} unique winner id:')
print(f'{np.sort(WNCAATourneyCompactResults.WTeamID.unique())}')

In [None]:
plt.figure(figsize=(10,7))
WNCAATourneyCompactResults.groupby('WTeamID').size().hist(bins=100, color='orange')
plt.xlabel('Distribution of the number of matches played by the winner teams')
plt.ylabel('Frequence')
plt.show()

## Loser teams

In [None]:
print(f'There are {WNCAATourneyCompactResults.LTeamID.nunique()} unique loser id:')
print(f'{np.sort(WNCAATourneyCompactResults.LTeamID.unique())}')

Number of unique winners = number of unique losers => No team always win or lose.

In [None]:
plt.figure(figsize=(10,7))
WNCAATourneyCompactResults.groupby('LTeamID').size().hist(bins=100, color='orange')
plt.xlabel('Distribution of the number of matches played by the loser teams')
plt.ylabel('Frequence')
plt.show()

The distribution of the number of matches played by the losers is skewer than that of the winners.

## Winner score

In [None]:
print(f'There are {WNCAATourneyCompactResults.WScore.nunique()} unique winner scores:')
print(f'{np.sort(WNCAATourneyCompactResults.WScore.unique())}')

In [None]:
print(f'Mean and std of winner scores: mean={np.round(WNCAATourneyCompactResults.WScore.mean(), 2)}, std={np.round(WNCAATourneyCompactResults.WScore.std(), 2)}')
plt.figure(figsize=(10,7))
WNCAATourneyCompactResults.WScore.hist(bins=100, color='orange')
plt.xlabel('Winner score')
plt.ylabel('Frequence')
plt.show()

## Loser score

In [None]:
print(f'There are {WNCAATourneyCompactResults.LScore.nunique()} unique loser scores:')
print(f'{np.sort(WNCAATourneyCompactResults.LScore.unique())}')

In [None]:
print(f'Mean and std of loser scores: mean={np.round(WNCAATourneyCompactResults.LScore.mean(), 2)}, std={np.round(WNCAATourneyCompactResults.LScore.std(), 2)}')
plt.figure(figsize=(10,7))
WNCAATourneyCompactResults.LScore.hist(bins=100, color='orange')
plt.xlabel('Loser score')
plt.ylabel('Frequence')
plt.show()

## Difference between winner and loser scores

In [None]:
WNCAATourneyCompactResults['DScore'] = WNCAATourneyCompactResults.WScore - WNCAATourneyCompactResults.LScore

In [None]:
print(f'There are {WNCAATourneyCompactResults.DScore.nunique()} unique different scores:')
print(f'{np.sort(WNCAATourneyCompactResults.DScore.unique())}')

In [None]:
print(f'Mean and std of different scores: mean={np.round(WNCAATourneyCompactResults.DScore.mean(), 2)}, std={np.round(WNCAATourneyCompactResults.DScore.std(), 2)}')
plt.figure(figsize=(10,7))
WNCAATourneyCompactResults.DScore.hist(bins=100, color='orange')
plt.xlabel('Score difference')
plt.ylabel('Frequence')
plt.show()

Unlike the previousely observed standard shape of the winner and loser score distribution, score difference between the winners and the losers is skewed.

## Overtime

In [None]:
print(f'There are {WNCAATourneyCompactResults.NumOT.nunique()} unique numbers overtime:')
print(f'{np.sort(WNCAATourneyCompactResults.NumOT.unique())}')

In [None]:
pie, ax = plt.subplots(figsize=[20,12])
WNCAATourneyCompactResults.groupby('NumOT').size().plot(kind='pie',
                                                    #autopct='%.2f',
                                                    ax=ax,
                                                    title='Overtime distibution',
                                                    rotatelabels =False,
                                                    cmap = 'tab10')
plt.show()

## Winner location

In [None]:
print(f'There are {WNCAATourneyCompactResults.WLoc.nunique()} unique winner location:')
print(f'{np.sort(WNCAATourneyCompactResults.WLoc.unique())}')

In [None]:
pie, ax = plt.subplots(figsize=[20,12])
WNCAATourneyCompactResults.groupby('WLoc').size().plot(kind='pie',
                                                    #autopct='%.2f',
                                                    ax=ax,
                                                    title='Winner location distibution',
                                                    rotatelabels =False,
                                                    cmap = 'tab10')
plt.show()

Clearly, playing at home helps the winner a lot.

# **<span style='color:#A80808'>🚀 Model</span>**

In [None]:
train = pd.DataFrame()
train['Season'] = WRegularSeasonCompactResults.Season
train['Team1'] = WRegularSeasonCompactResults.WTeamID
train['Team2'] = WRegularSeasonCompactResults.LTeamID
train['target'] = WRegularSeasonCompactResults.WScore / (WRegularSeasonCompactResults.WScore + WRegularSeasonCompactResults.LScore)

train['Team1'][WRegularSeasonCompactResults.WTeamID > WRegularSeasonCompactResults.LTeamID] = WRegularSeasonCompactResults.LTeamID
train['Team2'][WRegularSeasonCompactResults.WTeamID > WRegularSeasonCompactResults.LTeamID] = WRegularSeasonCompactResults.WTeamID
train['target'][WRegularSeasonCompactResults.WTeamID > WRegularSeasonCompactResults.LTeamID] = 1 - train.target
train.head()

In [None]:
# Verify that the first team ID is always smaller than the second team ID
(train.Team1 < train.Team2).unique()

In [None]:
model = rfr(n_estimators=100)

X = train[['Season', 'Team1', 'Team2']]
y = train.target
model.fit(X, y)

In [None]:
submission = pd.read_csv('../input/womens-march-mania-2022/WDataFiles_Stage1/WSampleSubmissionStage1.csv')

test = pd.DataFrame()
test['Season'] = submission.ID.apply(lambda x: int(x.split('_')[0]))
test['Team1'] = submission.ID.apply(lambda x: int(x.split('_')[1]))
test['Team2'] = submission.ID.apply(lambda x: int(x.split('_')[2]))
test.head()

In [None]:
# Verify that the first team ID is always smaller than the second team ID
(test.Team1 < test.Team2).unique()

# **<span style='color:#A80808'>🏆 Submission</span>**

In [None]:
submission['Pred'] = model.predict(test)
submission.to_csv('submission.csv', index=False)
submission.head()

# This notebook is under construction 🏗