# **<span style='color:#A80808'>🎯 Goal</span>**

Pick the winners and losers using a combination of rich historical data.

# **<span style='color:#A80808'>🔑 Metric</span>**

Submissions are scored on the log loss: $LogLoss = -frac{1}{n}\sum_{i=1}^{n}[y_ilog(\hat{y}_i)+(1-y_i)log(1-\hat{y}_i)]$

where:

* $n$ is the number of games played
* $\hat{y}_i$ is the predicted probability of team 1 beating team 2
* $y_i$ is 1 if team 1 wins, 0 if team 2 wins
* $log$ is the natural logarithm

The use of the logarithm provides extreme punishments for being both confident and wrong. In the worst possible case, a prediction that something is true when it is actually false will add an infinite amount to your error score. In order to prevent this, predictions are bounded away from the extremes by a small value.

# **<span style='color:#A80808'>💾 Data</span>**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import random

# **<span style='color:#A80808'>Data Section 1 file: MRegularSeasonCompactResults.csv</span>**

This file identifies the game-by-game results for many seasons of historical data, starting with the 1998 season. For each season, the file includes all games played from DayNum 0 through 132. It is important to realize that the "Regular Season" games are simply defined to be all games played on DayNum=132 or earlier (DayNum=133 is Selection Monday). Thus a game played before Selection Monday will show up here whether it was a pre-season tournament, a non-conference game, a regular conference game, a conference tournament game, or whatever.

* Season - this is the year of the associated entry in WSeasons.csv (the year in which the final tournament occurs). For example, during the 2016 season, there were regular season games played between November 2015 and March 2016, and all of those games will show up with a Season of 2016.
* DayNum - this integer always ranges from 0 to 132, and tells you what day the game was played on. It represents an offset from the "DayZero" date in the "WSeasons.csv" file. For example, the first game in the file was DayNum=18. Combined with the fact from the "WSeasons.csv" file that day zero was 10/27/1997 that year, this means the first game was played 18 days later, or 11/14/1997. There are no teams that ever played more than one game on a given date, so you can use this fact if you need a unique key (combining Season and DayNum and WTeamID).
* WTeamID - this identifies the id number of the team that won the game, as listed in the "WTeams.csv" file. No matter whether the game was won by the home team or visiting team, or if it was a neutral-site game, the "WTeamID" always identifies the winning team.
* WScore - this identifies the number of points scored by the winning team.
* LTeamID - this identifies the id number of the team that lost the game.
* LScore - this identifies the number of points scored by the losing team. Thus you can be confident that WScore will be greater than LScore for all games listed.
* NumOT - this indicates the number of overtime periods in the game, an integer 0 or higher.
* WLoc - this identifies the "location" of the winning team. If the winning team was the home team, this value will be "H". If the winning team was the visiting team, this value will be "A". If it was played on a neutral court, then this value will be "N". Sometimes it is unclear whether the site should be considered neutral, since it is near one team's home court, or even on their court during a tournament, but for this determination we have simply used the Kenneth Massey data in its current state, where the "@" sign is either listed with the winning team, the losing team, or neither team. If you would like to investigate this factor more closely, we invite you to explore Data Section 3, which provides the city that each game was played in, irrespective of whether it was considered to be a neutral site.

In [None]:
WRegularSeasonCompactResults =pd.read_csv('../input/mens-march-mania-2022/MDataFiles_Stage1/MRegularSeasonCompactResults.csv')
WRegularSeasonCompactResults

## Winner teams

In [None]:
print(f'There are {WRegularSeasonCompactResults.WTeamID.nunique()} unique winner id:')
print(f'{np.sort(WRegularSeasonCompactResults.WTeamID.unique())}')

In [None]:
plt.figure(figsize=(10,7))
WRegularSeasonCompactResults.groupby('WTeamID').size().hist(bins=100, color='orange')
plt.xlabel('Distribution of the number of matches played by the winner teams')
plt.ylabel('Frequence')
plt.show()

## Loser teams

In [None]:
print(f'There are {WRegularSeasonCompactResults.LTeamID.nunique()} unique loser id:')
print(f'{np.sort(WRegularSeasonCompactResults.LTeamID.unique())}')

Number of unique winners = number of unique losers => No team always win or lose.

In [None]:
plt.figure(figsize=(10,7))
WRegularSeasonCompactResults.groupby('LTeamID').size().hist(bins=100, color='orange')
plt.xlabel('Distribution of the number of matches played by the loser teams')
plt.ylabel('Frequence')
plt.show()

The distribution of the number of matches played by the losers is skewer than that of the winners.

## Winner score

In [None]:
print(f'There are {WRegularSeasonCompactResults.WScore.nunique()} unique winner scores:')
print(f'{np.sort(WRegularSeasonCompactResults.WScore.unique())}')

In [None]:
print(f'Mean and std of winner scores: mean={np.round(WRegularSeasonCompactResults.WScore.mean(), 2)}, std={np.round(WRegularSeasonCompactResults.WScore.std(), 2)}')
plt.figure(figsize=(10,7))
WRegularSeasonCompactResults.WScore.hist(bins=100, color='orange')
plt.xlabel('Winner score')
plt.ylabel('Frequence')
plt.show()

## Loser score

In [None]:
print(f'There are {WRegularSeasonCompactResults.LScore.nunique()} unique loser scores:')
print(f'{np.sort(WRegularSeasonCompactResults.LScore.unique())}')

In [None]:
print(f'Mean and std of loser scores: mean={np.round(WRegularSeasonCompactResults.LScore.mean(), 2)}, std={np.round(WRegularSeasonCompactResults.LScore.std(), 2)}')
plt.figure(figsize=(10,7))
WRegularSeasonCompactResults.LScore.hist(bins=100, color='orange')
plt.xlabel('Loser score')
plt.ylabel('Frequence')
plt.show()

## Difference between winner and loser scores

In [None]:
WRegularSeasonCompactResults['DScore'] = WRegularSeasonCompactResults.WScore - WRegularSeasonCompactResults.LScore

In [None]:
print(f'There are {WRegularSeasonCompactResults.DScore.nunique()} unique different scores:')
print(f'{np.sort(WRegularSeasonCompactResults.DScore.unique())}')

In [None]:
print(f'Mean and std of different scores: mean={np.round(WRegularSeasonCompactResults.DScore.mean(), 2)}, std={np.round(WRegularSeasonCompactResults.DScore.std(), 2)}')
plt.figure(figsize=(10,7))
WRegularSeasonCompactResults.DScore.hist(bins=100, color='orange')
plt.xlabel('Score difference')
plt.ylabel('Frequence')
plt.show()

Unlike the previousely observed standard shape of the winner and loser score distribution, score difference between the winners and the losers is skewed.

## Overtime

In [None]:
print(f'There are {WRegularSeasonCompactResults.NumOT.nunique()} unique numbers overtime:')
print(f'{np.sort(WRegularSeasonCompactResults.NumOT.unique())}')

In [None]:
pie, ax = plt.subplots(figsize=[20,12])
WRegularSeasonCompactResults.groupby('NumOT').size().plot(kind='pie',
                                                    #autopct='%.2f',
                                                    ax=ax,
                                                    title='Overtime distibution',
                                                    rotatelabels =False,
                                                    cmap = 'tab10')
plt.show()

## Winner location

In [None]:
print(f'There are {WRegularSeasonCompactResults.WLoc.nunique()} unique winner location:')
print(f'{np.sort(WRegularSeasonCompactResults.WLoc.unique())}')

In [None]:
pie, ax = plt.subplots(figsize=[20,12])
WRegularSeasonCompactResults.groupby('WLoc').size().plot(kind='pie',
                                                    #autopct='%.2f',
                                                    ax=ax,
                                                    title='Winner location distibution',
                                                    rotatelabels =False,
                                                    cmap = 'tab10')
plt.show()

Clearly, playing at home helps the winner a lot.

# **<span style='color:#A80808'>🏆 Submission</span>**

In [None]:
submission = pd.read_csv('../input/mens-march-mania-2022/MDataFiles_Stage1/MSampleSubmissionStage1.csv')
submission['Pred'] = np.random.rand()
submission.to_csv('submission.csv', index=False)
submission.head()

# This notebook is under construction 🏗