# EDA of March Machine Learning Mania 2022 Men’s

**About This Competition**
- A competition to predict the outcome of a two-year US Men's College basketball tournament  
22年の米国男子大学バスケットボールトーナメントの結果を予測するコンペ  
- 2Stage competition  
2つのStageに分かれる。   
- 評価指標は LogLoss

**Stage1: 過去データの予測 [モデル作成フェーズ]**
- Prediction of all combinations of teams that have participated in NCAA tournaments in the last 5 years (2,278 games x 5 years)  
過去5年間にNCAAトーナメントに参加したチームの全組み合わせ勝敗予測（2,278試合×5年）  
- Inference data (combination description) will be released from the beginning of the competition  
推論データ（組み合わせ記載）は、コンペ開始当初から公開  

**Stage2: 今年のデータの予測 [本番フェーズ]**
- Prediction of all combinations of teams participating in the 2022 NCAA Tournament (2,278 matches)  
2022年のNCAAトーナメントに参加するチームの全組み合わせ試合勝敗予測（2,278試合） 
- Inference data (combination description) will be released on 3/14 (Monday)  
推論データ（組み合わせ記載）は、3/14（月）に公開
- Inference data deadline: 3/17 (Thursday) PM3 UTC  
推論データ締切: 3/17（木）PM3 UTC  
- The tournament will be held from March 15th to April 4th (the ranking of the competition will be confirmed after April 4th)  
トーナメントは3/15〜4/4にかけて開催（4/4以降にコンペの順位が確定すると思われます）

[reference:【日本語】EDA of March Machine Learning Mania 2022 Men’](https://www.kaggle.com/kazuya99986/eda-of-march-machine-learning-mania-2022-men)  
[reference2:🏀🏀 2022 March Mania - Quick EDA & FE🏀🏀](https://www.kaggle.com/kalilurrahman/2022-march-mania-quick-eda-fe)

In [None]:
# ============================
# Import Libraries
# ============================
# Fundamentals
import os
from pathlib import Path
from glob import glob
from tqdm.notebook import tqdm

import pandas as pd
import numpy as np
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
# Visualize
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams.update({'font.size': 18})
plt.style.use('fivethirtyeight')

PATH = '../input/mens-march-mania-2022/'

# EDA

### Data Section 1 - The Basics
- This section provides everything you need to build a simple prediction model and submit predictions.
<br></br>
- Team ID's and Team Names
- Tournament seeds since 1984-85 season
- Final scores of all regular season, conference tournament, and NCAA® tournament games since 1984-85 season
- Season-level details including dates and region names
- Example submission file for stage 1

**MSampleSubmissionStage1.csv**
- This file illustrates the submission file format for Stage 1. 

In [None]:
submission = pd.read_csv(PATH + 'MDataFiles_Stage1/MSampleSubmissionStage1.csv')
print(f'submission.shape: {submission.shape}')
display(submission.head())

There are 11390 data to predict.  
- ID: season_team1(ID)_team2(ID)  
- Pred: Probability of A winning over B

In [None]:
submission[['year', 'Winner', 'Loser']] = pd.DataFrame(submission['ID'].str.split('_').values.tolist(), index=submission.index)
submission['year'] = pd.to_numeric(submission.year)
submission.head()

**MTeams.csv**

In [None]:
teams = pd.read_csv(PATH + 'MDataFiles_Stage1/MTeams.csv')
print(f'teams.shape: {teams.shape}')
display(teams.head())

- FirstD1Season: The first year of Division1
- FirstD1Season: The last year of Division1

In [None]:
# No duplicates! TeamID
len(teams.TeamID.unique())

In [None]:
teams[teams.LastD1Season==2022].reset_index(drop=True).tail()

In [None]:
# Visualize
fig = plt.figure(figsize=(10,5))
ax = fig.gca()
teams.hist(ax=ax)
plt.tight_layout()

Most of the teams have been in Divisiton1 since 1985

In [None]:
yr_count = pd.DataFrame({'year': np.arange(1985, 2022)})

for year in yr_count.year:
    teams['is_in'] = 0
    teams.loc[(teams.FirstD1Season <= year) & (teams.LastD1Season >= year), 'is_in'] = 1
    tot_teams = teams.is_in.sum()
    yr_count.loc[yr_count.year == year, 'n_teams'] = tot_teams
    
yr_count = yr_count.set_index('year')
yr_count.n_teams.plot(figsize=(12,4))
plt.title('Number of teams per year', fontsize=16)
plt.show()

MSeasons.csv

In [None]:
seasons = pd.read_csv(PATH + 'MDataFiles_Stage1/MSeasons.csv')
print(f'seasons.shape: {seasons.shape}')
display(seasons.head())
display(seasons.tail())

In [None]:
seasons.RegionW.value_counts()

In [None]:
seasons.RegionX.value_counts()

In [None]:
seasons.RegionY.value_counts()

In [None]:
seasons.RegionZ.value_counts()

Season2022's Region is TBD

MNCAATourneySeeds.csv

In [None]:
ncaa_seed = pd.read_csv(PATH + 'MDataFiles_Stage1/MNCAATourneySeeds.csv')
print(f'seasons.shape: {ncaa_seed.shape}')
display(ncaa_seed.head())

In [None]:
display(ncaa_seed.groupby('Season').count().head())
display(ncaa_seed.groupby('Season').count().tail())

**MRegularSeasonCompactResults.csv**
- This file identifies the game-by-game results for many seasons of historical data, starting with the 1985 season (the first year the NCAA® had a 64-team tournament). 

In [None]:
regular_compact_result = pd.read_csv(PATH + 'MDataFiles_Stage1/MRegularSeasonCompactResults.csv')
print(f'regular_compact_result.shape: {regular_compact_result.shape}')
display(regular_compact_result.head())

In [None]:
regular_compact_result["Point difference"] = regular_compact_result["WScore"] - regular_compact_result["LScore"]

In [None]:
display(regular_compact_result.describe())

In [None]:
display(regular_compact_result.groupby('Season').count().head())
display(regular_compact_result.groupby('Season').count().tail())

In [None]:
# Visualize
fig = plt.figure(figsize=(10,5))
ax = fig.gca()
regular_compact_result.hist(ax=ax)
plt.tight_layout()

In [None]:
tmp = regular_compact_result.sample(n=1000)
columns = ['Season', 'DayNum','WScore','LScore', 'Point difference',]

sns.pairplot(tmp[columns], diag_kind = 'kde',
             plot_kws = {'alpha': 0.3, 's': 20, 'edgecolor': 'k'},
             height = 2)

In [None]:
summaries = regular_compact_result[['Season', 'WScore', 'LScore', 'NumOT', 'Point difference']].groupby('Season').agg(['min', 'max', 'mean', 'median'])

summaries.columns = ['_'.join(col).strip() for col in summaries.columns.values]
summaries

In [None]:
summaries[[col for col in summaries.columns if 'Point difference' in col and 'sum' not in col]].plot(figsize=(12,4))
plt.title('Point difference over time', fontsize=16)
plt.show()

**MNCAATourneyCompactResults.csv**
- This file identifies the game-by-game NCAA® tournament results for all seasons of historical data. 
<br></br>
- DayNum=134 or 135 (Tue/Wed) - play-in games to get the tournament field down to the final 64 teams
- DayNum=136 or 137 (Thu/Fri) - Round 1, 64 teams to 32 teams
- DayNum=138 or 139 (Sat/Sun) - Round 2, 32 teams to 16 teams
- DayNum=143 or 144 (Thu/Fri) - Round 3,"Sweet Sixteen"
- DayNum=145 or 146 (Sat/Sun) - Round 4,"Elite Eight" or "regional finals"
- DayNum=152 (Sat) - Round 5, "Final Four" or "national semifinals"
- DayNum=154 (Mon) - Round 6, "national final" or "national championship"

In [None]:
ncaa_compact_result = pd.read_csv(PATH + 'MDataFiles_Stage1/MNCAATourneyCompactResults.csv')
print(f'ncaa_compact_result.shape: {ncaa_compact_result.shape}')
display(ncaa_compact_result.head())

In [None]:
ncaa_compact_result["Point difference"] = ncaa_compact_result["WScore"] - ncaa_compact_result["LScore"]

In [None]:
# Visualize
fig = plt.figure(figsize=(10,5))
ax = fig.gca()
ncaa_compact_result.hist(ax=ax)
plt.tight_layout()

In [None]:
tmp = ncaa_compact_result.sample(n=1000)
columns = ['Season', 'DayNum','WScore','LScore', 'Point difference',]

sns.pairplot(tmp[columns], diag_kind = 'kde',
             plot_kws = {'alpha': 0.3, 's': 20, 'edgecolor': 'k'},
             height = 2)

In [None]:
summaries = ncaa_compact_result[['Season', 'WScore', 'LScore', 'NumOT', 'Point difference']].groupby('Season').agg(['min', 'max', 'mean', 'median'])

summaries.columns = ['_'.join(col).strip() for col in summaries.columns.values]
summaries

In [None]:
summaries[[col for col in summaries.columns if 'Point difference' in col and 'sum' not in col]].plot(figsize=(12,4))
plt.title('Point difference over time', fontsize=16)
plt.show()

### Data Section 2 - Team Box Scores

This section provides game-by-game stats at a team level (free throws attempted, defensive rebounds, turnovers, etc.)  
for all regular season, conference tournament, and NCAA® tournament games since the 2002-03 season.

- WFGM - field goals made (by the winning team)
- WFGA - field goals attempted (by the winning team)
- WFGM3 - three pointers made (by the winning team)
- WFGA3 - three pointers attempted (by the winning team)
- WFTM - free throws made (by the winning team)
- WFTA - free throws attempted (by the winning team)
- WOR - offensive rebounds (pulled by the winning team)
- WDR - defensive rebounds (pulled by the winning team)
- WAst - assists (by the winning team)
- WTO - turnovers committed (by the winning team)
- WStl - steals (accomplished by the winning team)
- WBlk - blocks (accomplished by the winning team)
- WPF - personal fouls committed (by the winning team)

**MRegularSeasonDetailedResults.csv**
- This file provides team-level box scores for many regular seasons of historical data, starting with the 2003 season.  
All games listed in the MRegularSeasonCompactResults file since the 2003 season  
should exactly be present in the MRegularSeasonDetailedResults file.

In [None]:
regular_detail_result = pd.read_csv(PATH + 'MDataFiles_Stage1/MRegularSeasonDetailedResults.csv')
print(f'regular_detail_result.shape: {regular_detail_result.shape}')
display(regular_detail_result.head())

In [None]:
display(regular_detail_result.describe())

This data is after 2003 only.

In [None]:
regular_detail_result.WLoc.unique()
# Nutral, Home, Away

In [None]:
tmp = regular_detail_result.sample(n=1000)
columns = ['WLoc', 'WScore', 'WFGM', 'WFGA', 'WFGM3', 'WFGA3' ]
sns.pairplot(tmp[columns], diag_kind = 'kde', hue='WLoc',
             plot_kws = {'alpha': 0.2, 's': 20, 'edgecolor': 'k'},
             height = 2)

In [None]:
columns = ['WLoc', 'WFTM', 'WFTA', 'WOR', 'WDR', ]
sns.pairplot(tmp[columns], diag_kind = 'kde', hue='WLoc',
             plot_kws = {'alpha': 0.2, 's': 20, 'edgecolor': 'k'},
             height = 2)

In [None]:
columns = ['WLoc', 'WTO', 'WStl', 'WBlk', 'WPF', ]
sns.pairplot(tmp[columns], diag_kind = 'kde', hue='WLoc',
             plot_kws = {'alpha': 0.2, 's': 20, 'edgecolor': 'k'},
             height = 2)

**MNCAATourneyDetailedResults.csv**
- This file provides team-level box scores for many NCAA® tournaments, starting with the 2003 season.  
All games listed in the MNCAATourneyCompactResults file since the 2003 season  
should exactly be present in the MNCAATourneyDetailedResults file.

In [None]:
tourney_detail_result = pd.read_csv(PATH + 'MDataFiles_Stage1/MNCAATourneyDetailedResults.csv')
print(f'tourney_detail_result.shape: {tourney_detail_result.shape}')
display(tourney_detail_result.head())

In [None]:
display(tourney_detail_result.describe())

This is also after 2003 only.

In [None]:
tmp = tourney_detail_result.sample(n=1000)
columns = ['WLoc', 'WTO', 'WStl', 'WBlk', 'WPF', ]
sns.pairplot(tmp[columns], diag_kind = 'kde', hue='WLoc',
             plot_kws = {'alpha': 0.2, 's': 20, 'edgecolor': 'k'},
             height = 2)

In [None]:
columns = ['WLoc', 'WFTM', 'WFTA', 'WOR', 'WDR', ]
sns.pairplot(tmp[columns], diag_kind = 'kde', hue='WLoc',
             plot_kws = {'alpha': 0.2, 's': 20, 'edgecolor': 'k'},
             height = 2)

In [None]:
columns = ['WLoc', 'WTO', 'WStl', 'WBlk', 'WPF', ]
sns.pairplot(tmp[columns], diag_kind = 'kde', hue='WLoc',
             plot_kws = {'alpha': 0.2, 's': 20, 'edgecolor': 'k'},
             height = 2)

### Data Section 3 - Geography
This section provides city locations of all regular season, conference tournament, and NCAA® tournament games  
since the 2009-10 season

**Cities.csv**  
- This file provides a master list of cities that have been locations for games played.

In [None]:
cities = pd.read_csv(PATH + 'MDataFiles_Stage1/Cities.csv')
print(f'cities.shape: {cities.shape}')
display(cities.head())

In [None]:
len(cities.CityID.unique())

MGameCities.csv
- This file identifies all games, starting with the 2010 season, along with the city that the game was played in. 

In [None]:
m_game_cities = pd.read_csv(PATH + 'MDataFiles_Stage1/MGameCities.csv')
print(f'm_game_cities.shape: {m_game_cities.shape}')
display(m_game_cities.head())

- CRType - this can be either Regular or NCAA or Secondary. 

### Data Section 4 - Public Rankings
This section provides weekly team rankings for dozens of top rating systems - Pomeroy, Sagarin, RPI, ESPN, etc.,  
since the 2002-2003 season

**MMasseyOrdinals.csv**
- This file lists out rankings (e.g. #1, #2, #3, ..., #N) of teams going back to the 2002-2003 season,  
under a large number of different ranking system methodologies. 

In [None]:
m_massy_ordinals = pd.read_csv(PATH + 'MDataFiles_Stage1/MMasseyOrdinals.csv')
print(f'm_massy_ordinals.shape: {m_massy_ordinals.shape}')
display(m_massy_ordinals.head())
display(m_massy_ordinals.tail())

- RankingDayNum - from 0 to 133, in the same terms as a game's DayNum (where DayZero is found in the MSeasons.csv file).  
- SystemName - this is the (usually) 3-letter abbreviation for each distinct ranking system. 
- OrdinalRank - this is the overall ranking of the team in the underlying system.  
#1 through #351, but more recently they go higher because additional teams were added to Division I in recent years.

In [None]:
len(m_massy_ordinals["SystemName"].unique())

In [None]:
display(m_massy_ordinals.groupby("Season")["OrdinalRank"].max())

### **Data Section 5 - Supplements**
- This section contains additional supporting information, including coaches, conference affiliations, alternative team name spellings, bracket structure, and game results for NIT and other postseason tournaments.

**MTeamCoaches.csv**

This file indicates the head coach for each team in each season, including a start/finish range of DayNum's to indicate a mid-season coaching change.

In [None]:
m_team_coaches = pd.read_csv(PATH + 'MDataFiles_Stage1/MTeamCoaches.csv')
print(f'm_team_coaches.shape: {m_team_coaches.shape}')
display(m_team_coaches.head())
display(m_team_coaches.tail())

In [None]:
m_team_coaches[m_team_coaches["FirstDayNum"] > 0].count()

In [None]:
len(m_team_coaches["CoachName"].unique())

**Conferences.csv**
- This file indicates the Division I conferences that have existed over the years since 1985. 

In [None]:
conferences = pd.read_csv(PATH + 'MDataFiles_Stage1/Conferences.csv')
print(f'conferences.shape: {conferences.shape}')
display(conferences.head())

**MTeamConferences.csv**
- This file indicates the conference affiliations for each team during each season. 

In [None]:
m_team_conferences = pd.read_csv(PATH + 'MDataFiles_Stage1/MTeamConferences.csv')
print(f'm_team_conferences.shape: {m_team_conferences.shape}')
display(m_team_conferences.head())

**MConferenceTourneyGames.csv**
- This file indicates which games were part of each year's post-season conference tournaments  
(all of which finished on Selection Sunday or earlier), starting from the 2001 season.

In [None]:
m_conference_tourney_games = pd.read_csv(PATH + 'MDataFiles_Stage1/MConferenceTourneyGames.csv')
print(f'm_conference_tourney_games.shape: {m_conference_tourney_games.shape}')
display(m_conference_tourney_games.head())

**MSecondaryTourneyTeams.csv**
- This file identifies the teams that participated in post-season tournaments other than the NCAA® Tournament  
(such events would run in parallel with the NCAA® Tournament). 

In [None]:
m_secondary_tourney_teams = pd.read_csv(PATH + 'MDataFiles_Stage1/MSecondaryTourneyTeams.csv')
print(f'm_secondary_tourney_teams.shape: {m_secondary_tourney_teams.shape}')
display(m_secondary_tourney_teams.head())

- SecondaryTourney - this is the abbreviation of the tournament, either NIT, CBI, CIT, or V16 (which stands for Vegas 16).

**MSecondaryTourneyCompactResults.csv**
- This file indicates the final scores for the tournament games of "secondary" post-season tournaments: the NIT, CBI, CIT, and Vegas 16.

In [None]:
m_secondary_tourney_compact_results = pd.read_csv(PATH + 'MDataFiles_Stage1/MSecondaryTourneyCompactResults.csv')
print(f'm_secondary_tourney_compact_results.shape: {m_secondary_tourney_compact_results.shape}')
display(m_secondary_tourney_compact_results.head())

**MTeamSpellings.csv**
- This file indicates alternative spellings of many team names.  
It is intended for use in associating external spellings against our own TeamID numbers,  
thereby helping to relate the external data properly with our datasets. 

In [None]:
m_team_spellings = pd.read_csv(PATH + 'MDataFiles_Stage1/MTeamSpellings.csv', encoding='cp932')
print(f'm_team_spellings.shape: {m_team_spellings.shape}')
display(m_team_spellings.head())

- TeamNameSpelling - this is the spelling of the team name.  
It is always expressed in all lowercase letters  
e.g. "ball state" rather than "Ball State" - in order to emphasize that any comparisons should be case-insensitive when matching.

**MNCAATourneySlots**
- This file identifies the mechanism by which teams are paired against each other, depending upon their seeds, as the tournament proceeds through its rounds. 

In [None]:
mmcaa_tourney_slots = pd.read_csv(PATH + 'MDataFiles_Stage1/MNCAATourneySlots.csv')
print(f'mmcaa_tourney_slots.shape: {mmcaa_tourney_slots.shape}')
display(mmcaa_tourney_slots.head())

**MNCAATourneySeedRoundSlots.csv**
- This file helps to represent the bracket structure in any given year. 

In [None]:
mmcaa_tourney_seed_round_slots = pd.read_csv(PATH + 'MDataFiles_Stage1/MNCAATourneySeedRoundSlots.csv')
print(f'mmcaa_tourney_seed_round_slots.shape: {mmcaa_tourney_seed_round_slots.shape}')
display(mmcaa_tourney_seed_round_slots.head())

- Seed - tournament seed
- GameRound - the round during the tournament  
Round 0 (zero) is for the play-in games, - Rounds 1/2 are for the first weekend, Rounds 3/4 are for the second weekend, and Rounds 5/6 are the national semifinals and finals.
- GameSlot - this is the game slot that the team would be playing in, during the given GameRound.  
The naming convention for slots is described above, in the definition of the MNCAATourneySlots file.
EarlyDayNum, LateDayNum - these fields describe the earliest possible, and latest possible, DayNums that the game might be played on.