### Men's NCAA Tournament Lab

Welcome!  This lab is designed to introduce you all to building features and scoring models on game data from the NCAA tournament.  

When you're done, you should be able to work through the basics of using predictive models in these types of situations.

**Step 1:** Import files for the seeds, ncaa tournament games, and regular season games.  Also import the exported csv you made from class for the initial one variable model you fit.

In [3]:
import pandas as pd
import numpy as np
seeds = pd.read_csv('../../data/NCAA/MNCAATourneySeeds.csv')
results = pd.read_csv('../../data/NCAA/MNCAATourneyCompactResults.csv')

season = pd.read_csv('../../data/NCAA/MRegularSeasonCompactResults.csv')

game_data = pd.read_csv('../../data/NCAA/game_data.csv')

**Step 2:** Create a Training & Test Set, With the Test Set Comprising of All Games 2015 & After.  Use the exported csv from class for this, since it's already prepped.

In [38]:
train1 = game_data[game_data['Season'] < 2015]
test1 = game_data[~game_data['Season'] < 2015]

**Step 3:** Find an initial validation score with the 1 seed model, and a RandomForest Classifier, right out of the box.

 - Run KFold, using 10 splits
 - Just use the seed difference for X
 - FYI: The score being returned is prediction accuracy

In [39]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
rf = RandomForestClassifier()

X1 = train1[['SeedDiff']]
y1 = train1['Result']

cross_scores = cross_val_score(rf, X1, y1, cv=10)

What is your initial validation score?

In [37]:
np.mean(cross_scores)

0.7066872008268059

**Step 4:** Create new data that captures the won-loss record of each team

We're going to break this down into smaller steps to make it easier to digest

**a).** Use `groupby()` to group teams based on `Season` and `WTeamID` in the dataset for regular season games.  Apply the `count()` aggregator to one of the columns to determine how many games each team won.

In [12]:
wins = season.groupby(['Season', 'WTeamID'])['WScore'].count().reset_index(name='Wins')
losses = season.groupby(['Season', 'LTeamID'])['LScore'].count().reset_index(name='Losses')

**b).** Save the grouping from the previous step as it's own variable, but with the following additions:

 - tack on the `reset_index()` method at the end -- note what this does
 - as an argument for the `reset_index()` method, pass in `name=Wins`

In [6]:
# your answer here

**c).** Repeat steps `a` and `b`, but this time group in `LTeamID` and make the new column called `Losses` instead of `Wins`.

In [7]:
# your answer here

At this point -- look at the two variables you created, and just make sure you can make sense out of what they're telling you.  You should have two separate dataframes that tell you how many wins & losses each team in each season had from 1985 until tolday.

**Step 5:** Merge your data back into your original data set

This can be a little tedious and time consuming, but you have to be careful in order to make sure you get it right.

**Part 1:** Building Features for Team 1

**a).** How many games did team 1 win?

Do the following merge:

 - **left dataset:**  the exported csv file from class
 - **right dataset:** the data with each team's losses
 - **merge type:** left
 - **left columns to join:** `'Season'`, `'T1TeamID'`
 - **right columns to join:** `'Season'`, `'WTeamID'`
 - **new column name:** `'T1Wins'`

In [15]:
game_data = game_data.merge(wins, how='left', left_on=['Season', 'T1TeamID'], right_on=['Season', 'WTeamID'])
game_data = game_data.merge(losses, how='left', left_on=['Season', 'T1TeamID'], right_on=['Season', 'LTeamID'])
game_data.head()

Unnamed: 0,Season,T1TeamID,T1Score,T2TeamID,T2Score,Result,T1Seed,T2Seed,SeedDiff,WTeamID,Wins,LTeamID,Losses
0,1985,1116,63,1234,54,1,9,8,1,1116,21,1116.0,12.0
1,1985,1120,59,1345,58,1,11,6,5,1120,18,1120.0,11.0
2,1985,1207,68,1250,43,1,1,16,-15,1207,25,1207.0,2.0
3,1985,1229,58,1425,55,1,9,8,1,1229,20,1229.0,7.0
4,1985,1242,49,1325,38,1,3,14,-11,1242,23,1242.0,7.0


**b).** How many games did team 1 lose?

Do the following merge:

 - **left dataset:**  the exported csv file from class
 - **right dataset:** the data with each team's losses
 - **merge type:** left
 - **left columns to join:** `'Season'`, `'T1TeamID'`
 - **right columns to join:** `'Season'`, `'LTeamID'`
 - **new column name:** `'T1Losses'`

In [9]:
# your answer here

**c).** Some teams have gone undefeated.  If that's the case there will be no entries for them in the loss column.  Fill in these values with 0 now.

In [17]:
game_data['Losses'].fillna(0, inplace=True)

**d).** You probably have some unnecessary columns right now.  Remove unnecessary columns created from the merges if they exist.  These are most likely going to be the `WTeamID` and `LTeamID` columns.

In [18]:
game_data.drop(['WTeamID', 'LTeamID'], axis=1, inplace=True)

**e).** Now create a new column called `T1WinPCT` that's the winning percentage of team 1.

In [19]:
game_data['T1WinPCT'] = game_data['Wins'] / (game_data['Wins'] + game_data['Losses'])

**Part II:**  Build the same features for Team II

Your turn:  Try and recreate the exact same features you just created for the first team, but for the second.

**Hint:**  In your original dataset, swap out `T1TeamID` for `T2TeamID` for the merges.

In [20]:
game_data = game_data.merge(wins, how='left', left_on=['Season', 'T2TeamID'], right_on=['Season', 'WTeamID'])
game_data = game_data.merge(losses, how='left', left_on=['Season', 'T2TeamID'], right_on=['Season', 'LTeamID'])
game_data.head()

Unnamed: 0,Season,T1TeamID,T1Score,T2TeamID,T2Score,Result,T1Seed,T2Seed,SeedDiff,Wins_x,Losses_x,T1WinPCT,WTeamID,Wins_y,LTeamID,Losses_y
0,1985,1116,63,1234,54,1,9,8,1,21,12.0,0.636364,1234,20,1234.0,10.0
1,1985,1120,59,1345,58,1,11,6,5,18,11.0,0.62069,1345,17,1345.0,8.0
2,1985,1207,68,1250,43,1,1,16,-15,25,2.0,0.925926,1250,11,1250.0,18.0
3,1985,1229,58,1425,55,1,9,8,1,20,7.0,0.740741,1425,19,1425.0,9.0
4,1985,1242,49,1325,38,1,3,14,-11,23,7.0,0.766667,1325,20,1325.0,7.0


In [21]:
game_data.rename({'Wins_x': 'T1Wins', 
                  'Losses_x': 'T1Losses',
                 'Wins_y': 'T2Wins',
                  'Losses_y': 'T2Losses'}, axis=1, inplace=True)

In [22]:
game_data.drop(['WTeamID', 'LTeamID'], axis=1, inplace=True)

In [31]:
game_data['T2Losses'].fillna(0, inplace=True)
game_data['T2WinPCT'] = game_data['T2Wins'] / (game_data['T2Wins'] + game_data['T2Losses'])
game_data.head()

Unnamed: 0,Season,T1TeamID,T1Score,T2TeamID,T2Score,Result,T1Seed,T2Seed,SeedDiff,T1Wins,T1Losses,T1WinPCT,T2Wins,T2Losses,T2WinPCT
0,1985,1116,63,1234,54,1,9,8,1,21,12.0,0.636364,20,10.0,0.666667
1,1985,1120,59,1345,58,1,11,6,5,18,11.0,0.62069,17,8.0,0.68
2,1985,1207,68,1250,43,1,1,16,-15,25,2.0,0.925926,11,18.0,0.37931
3,1985,1229,58,1425,55,1,9,8,1,20,7.0,0.740741,19,9.0,0.678571
4,1985,1242,49,1325,38,1,3,14,-11,23,7.0,0.766667,20,7.0,0.740741


**Step 6:** Recreate your training and test sets from the original data source, using the same criteria as before

In [32]:
train = game_data[game_data['Season'] < 2015]
test = game_data[~game_data['Season'] < 2015]

**Step 7:** Recreate `X` and `y`, except this time include the new features that you added -- Wins and losses for each team, as well as their winning percentage

In [33]:
X = game_data[['SeedDiff', 'T1Wins', 'T1Losses', 'T1WinPCT', 'T2Wins', 'T2Losses', 'T2WinPCT']]
y = game_data['Result']

**Step 8:** Re-check your validation scores with the new data, using the same conditions that we did in the previous step.  See if your validation scores improved at all.

In [34]:
cross_scores2 = cross_val_score(rf, X, y, cv=10)
print(np.mean(cross_scores2), cross_scores2)

0.673489529440749 [0.654102   0.64301552 0.73111111 0.69111111 0.63777778 0.67777778
 0.71777778 0.65555556 0.66444444 0.66222222]


Did your results improve?

In [35]:
np.mean(cross_scores2) - np.mean(cross_scores)

-0.03502466768718837

**Step 9:** Of the two different versions of our model that we just tested, take the best one, fit your random forest on its training data, and then score it on your test set to see how your final results come out.

In [40]:
rf.fit(X1, y1)
rf.score(test[['SeedDiff']], test['Result'])

0.7099067081297201

**Step 10:** How close were your validation and test results?  Ie, how reliable were our validation results?

Test scores actually improved when compared to the training scores. The seed diff column wokre better than added features

**Bonus:** If time permits, you can try a few different permutations of what we just did to continue to improve your results.  Including:

 - Trying to add more features beyon each team's winning percentage (perhaps average point differential would be more informative)
 - Using a grid search to find the best parameters of your random forest and seeing how that improves your results

In [41]:
from sklearn.model_selection import GridSearchCV

In [47]:
params1 = {
    'n_estimators': [100, 200, 300, 500, 750],
    'min_samples_leaf': [1, 3, 5, 7]
}

In [48]:
grid1 = GridSearchCV(rf, params1, cv=10)
grid1.fit(X1, y1)

GridSearchCV(cv=10, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False,
                                              rand

In [49]:
grid1.best_params_

{'min_samples_leaf': 5, 'n_estimators': 750}

In [50]:
grid1.best_score_

0.7100800968233246

In [51]:
grid1.fit(X, y)

GridSearchCV(cv=10, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False,
                                              rand

In [52]:
grid1.best_params_

{'min_samples_leaf': 7, 'n_estimators': 500}

In [53]:
grid1.best_score_

0.7043577235772359