In [1]:
import pandas as pd

# Data Cleaning
In this notebook we will be loading all the data experimentally and seeing what data is there and what we could potentially do with it.

First things first, let's examine the data directory for our files we uploaded.

In [2]:
%ls /data/dtumer

2016DetailedSeasonResults.csv     TourneyCompactResults.csv
RegularSeasonCompactResults.csv   TourneyDetailedResults.csv
RegularSeasonDetailedResults.csv  TourneySeeds.csv
Seasons.csv                       TourneySlots.csv
Teams.csv


Next we can set up a prefix variable for holding the value of the data directory path.

In [3]:
dat_loc_prefix = "/data/dtumer/"

## Teams
In this section we will be loading the teams data and looking at part of the data to see what columns are present.

In [4]:
teams = pd.read_csv(dat_loc_prefix + "Teams.csv")
teams.set_index("Team_Id", inplace=True)

In [5]:
teams.head()

Unnamed: 0_level_0,Team_Name
Team_Id,Unnamed: 1_level_1
1101,Abilene Chr
1102,Air Force
1103,Akron
1104,Alabama
1105,Alabama A&M


## Seasons
In this section we will be loading the seasons data and taking a look at what columns we have to work with. This file may not be used later as it has to do with the NCAA tournament and not game matchups.

In [6]:
seasons = pd.read_csv(dat_loc_prefix + "Seasons.csv")

In [7]:
seasons.head()

Unnamed: 0,Season,Dayzero,Regionw,Regionx,Regiony,Regionz
0,1985,10/29/1984,East,West,Midwest,Southeast
1,1986,10/28/1985,East,Midwest,Southeast,West
2,1987,10/27/1986,East,Southeast,Midwest,West
3,1988,11/02/1987,East,Midwest,Southeast,West
4,1989,10/31/1988,East,West,Midwest,Southeast


## Regular Season Results
In this section we will explore the legacy file for the seasons 1984-2015. This data only contains winning/losing scores and is most likely not useful in our solution.

In [8]:
r_season_results_c = pd.read_csv(dat_loc_prefix + "RegularSeasonCompactResults.csv")

In [9]:
r_season_results_c.head()

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
0,1985,20,1228,81,1328,64,N,0
1,1985,25,1106,77,1354,70,H,0
2,1985,25,1112,63,1223,56,H,0
3,1985,25,1165,70,1432,54,H,0
4,1985,25,1192,86,1447,74,H,0


The folloing file, on the other hand, is extremely important. This file will be used to base all of our predictions off of. This file contains all the data we need to create an effective machine learning model for game matchups.

In [10]:
r_season_results_d = pd.read_csv(dat_loc_prefix + "RegularSeasonDetailedResults.csv")

In [11]:
r_season_results_d.head()

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot,Wfgm,Wfga,...,Lfga3,Lftm,Lfta,Lor,Ldr,Last,Lto,Lstl,Lblk,Lpf
0,2003,10,1104,68,1328,62,N,0,27,58,...,10,16,22,10,22,8,18,9,2,20
1,2003,10,1272,70,1393,63,N,0,26,62,...,24,9,20,20,25,7,12,8,6,16
2,2003,11,1266,73,1437,61,N,0,24,58,...,26,14,23,31,22,9,12,2,5,23
3,2003,11,1296,56,1457,50,N,0,18,38,...,22,8,15,17,20,9,19,4,3,23
4,2003,11,1400,77,1208,71,N,0,30,61,...,16,17,27,21,15,12,10,7,1,14


## Tourney Season Results
This section covers loading the tournament game-by-game results. These files will only be necessary to supplement the original regular season detailed results file. Again in this section the compact file with games from 1984-2015 will not be used because of how lacking it is in columns that we can analyze.

In [12]:
tourney_results_c = pd.read_csv(dat_loc_prefix + "TourneyCompactResults.csv")

In [13]:
tourney_results_c.head()

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot
0,1985,136,1116,63,1234,54,N,0
1,1985,136,1120,59,1345,58,N,0
2,1985,136,1207,68,1250,43,N,0
3,1985,136,1229,58,1425,55,N,0
4,1985,136,1242,49,1325,38,N,0


The following file will be useful in embellishing the regular season results with added game details for teams that played in the tournament.

In [14]:
tourney_results_d = pd.read_csv(dat_loc_prefix + "TourneyDetailedResults.csv")

In [15]:
tourney_results_d.head()

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot,Wfgm,Wfga,...,Lfga3,Lftm,Lfta,Lor,Ldr,Last,Lto,Lstl,Lblk,Lpf
0,2003,134,1421,92,1411,84,N,1,32,69,...,31,14,31,17,28,16,15,5,0,22
1,2003,136,1112,80,1436,51,N,0,31,66,...,16,7,7,8,26,12,17,10,3,15
2,2003,136,1113,84,1272,71,N,0,31,59,...,28,14,21,20,22,11,12,2,5,18
3,2003,136,1141,79,1166,73,N,0,29,53,...,17,12,17,14,17,20,21,6,6,21
4,2003,136,1143,76,1301,74,N,1,27,64,...,21,15,20,10,26,16,14,5,8,19


## Tourney Team Seeds
This section covers loading the tourney team seeds file that has the seeds of each team of each tournament. This file will also not be useful for us since we are not analyzing the tournament specifically.

In [16]:
tourney_seeds = pd.read_csv(dat_loc_prefix + "TourneySeeds.csv")

In [17]:
tourney_seeds.head()

Unnamed: 0,Season,Seed,Team
0,1985,W01,1207
1,1985,W02,1210
2,1985,W03,1228
3,1985,W04,1260
4,1985,W05,1374


## Tourney Slots
This section covers loading the tourney slot matchups for the tournaments. This file will not be necessary, as again, it is only for tournament data and not regular season data.

In [18]:
tourney_slots = pd.read_csv(dat_loc_prefix + "TourneySlots.csv")

In [19]:
tourney_slots.head()

Unnamed: 0,Season,Slot,Strongseed,Weakseed
0,1985,R1W1,W01,W16
1,1985,R1W2,W02,W15
2,1985,R1W3,W03,W14
3,1985,R1W4,W04,W13
4,1985,R1W5,W05,W12


# 2016 Regular Season Results
This section covers loading the 2016 regular season game-by-game results. This file is very important as the data from it will be used to predict matchups. This data will be averaged per team over the whole season and this data will be inputted into our machine learning model to predict outcomes of game matchups.

In [20]:
new_season = pd.read_csv(dat_loc_prefix + "2016DetailedSeasonResults.csv")
new_season = new_season[new_season.Season == 2016]

In [21]:
new_season.head()

Unnamed: 0,Season,Daynum,Wteam,Wscore,Lteam,Lscore,Wloc,Numot,Wfgm,Wfga,...,Lfga3,Lftm,Lfta,Lor,Ldr,Last,Lto,Lstl,Lblk,Lpf
65872,2016,11,1104,77,1244,64,H,0,29,57,...,19,19,26,12,27,6,16,7,4,25
65873,2016,11,1105,68,1408,67,A,1,25,64,...,27,16,26,18,30,11,19,6,7,21
65874,2016,11,1112,79,1334,61,H,0,24,61,...,19,13,19,5,23,9,9,3,1,25
65875,2016,11,1115,58,1370,56,A,0,20,55,...,16,16,28,10,31,12,15,5,0,17
65876,2016,11,1116,86,1380,68,H,0,32,66,...,12,20,28,7,21,9,17,8,5,22


# Saving The Data
In this section we save all relevent files to pandas pickles for use in other notebooks.

In [22]:
teams.to_pickle("teams")
r_season_results_d.to_pickle("past_season_detailed_results")
tourney_results_d.to_pickle("past_tourney_detailed_results")
new_season.to_pickle("new_season_detailed_results")