In [1]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from ipywidgets import interact, fixed

# Machine Learning
In this notebook we will explore different machine learning algorithms and try to fit an accurate algorithm to our data so that we can predict future results of NCAA D1 basketball matchups.

## Load The Data Set
Here we will load the data set that was created in the modeling section of this project.

In [2]:
machine_learning_dataset = pd.read_pickle("machine_learning_dataset")

In [3]:
machine_learning_dataset.head()

Unnamed: 0,Wfgp,Wfgp3,Wdr,Wast,Wto,Wpf,Lfgp,Lfgp3,Ldr,Last,Lto,Lpf,Win
46253,0.454545,0.3125,26,15,12,16,0.339286,0.066667,22,9,15,23,True
62629,0.42,0.307692,32,13,19,16,0.338462,0.136364,23,5,12,19,True
21495,0.510638,0.533333,21,19,14,14,0.333333,0.222222,12,11,13,25,True
55532,0.451613,0.333333,25,11,8,23,0.363636,0.25,29,11,13,15,True
42770,0.48,0.473684,23,14,9,14,0.5,0.4,22,11,11,19,True


In [4]:
assert len(machine_learning_dataset) == 66719
assert machine_learning_dataset.isnull().values.any() == False

## Create a New Train/Test Split In The Data
In this section we will create another train/test split in the data so that we can analyze the accuracy of different regression models on our data. This time we will not be splitting 50/50 we will instead be splitting 75/25 so that we can test the accuracy of our model while also being able to train the model effectively.

In [5]:
# column setup for the feature columns and target column
feature_columns = list(machine_learning_dataset.columns[:-1])
target_column = list(machine_learning_dataset.columns[-1:])

In [6]:
from sklearn import cross_validation

In [7]:
X = machine_learning_dataset[feature_columns]
y = machine_learning_dataset[target_column[0]]

In [8]:
Xtrain, Xtest, ytrain, ytest = cross_validation.train_test_split(X, y)

In [9]:
print(Xtrain.shape)
print(ytrain.shape)

(50039, 12)
(50039,)


In [10]:
print(Xtest.shape)
print(ytest.shape)

(16680, 12)
(16680,)


# Choosing A Machine Learning Model
In this section we will explore different machine learning models. Exploring different models will help us determine the correct classifier. Of course in some cases some classifiers are better than others and we would like to get the most accurate classifier for our dataset.

## Logistic Regression
The Logistic Regression Classification model was the first model I decided to look at. The model itself takes into account overfitting of the data and also does a very good job analyzing and making predictions about the data. Let's analyze its accuracy on our dataset.

In [11]:
from sklearn import linear_model

This cell creates our regression model. I chose "newton-cg" as the solver mostly because my dataset is fairly large and this solver is supposed to provide performance enhancements. Other variables were tweaked to bring the accuracy of the model up slightly.

In [12]:
logistic_regression_model = linear_model.LogisticRegression(solver="newton-cg")
logistic_regression_model.fit(Xtrain, ytrain)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='newton-cg', tol=0.0001,
          verbose=0, warm_start=False)

Here we will measure the accuracy of our model:

In [13]:
logistic_regression_model.score(Xtest, ytest)

0.93201438848920859

Wow! So there are a few things that could be going on here. That number is unusually high for a predictor. Usually when a number is that high it stems from skewed data. In our case more analysis of the data will be needed to see if this is an accurate score for our logistic model.

## Random Forest Classifier
This regression involves using randomly created trees to fit the data. Another bonus about the Random Forest Classifier is that it too, like logistic regression, does a good job at handling overfitting of the data. This classifier uses trees to fit various random sub-samples of the data to increase accuracy and stop overfitting.

In [14]:
from sklearn import ensemble

Through analysis of the classifier the more trees that are added as estimators the more accurate the model.

In [15]:
forest_model = ensemble.RandomForestClassifier(n_estimators=50)
forest_model.fit(Xtrain, ytrain)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=50, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [16]:
forest_model.score(Xtest, ytest)

0.90587529976019188

This classifier is also very accurate just as Logistic Regression is. Again this could be because of the correlation between the data. Weight analysis of features may be used later to analyze this problem.

## Decision Tree Classifier
This model uses what are called decision trees to fit data. This involves creating "if-statements" regarding data to be able to make accurate predictions about new data. This may not be useful in our model just because of the correlation between features in our data. Some features dominate our data and thus could create biased decision trees.

In [17]:
from sklearn import tree

In [18]:
decision_tree_model = tree.DecisionTreeClassifier()
decision_tree_model.fit(Xtrain, ytrain)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [19]:
decision_tree_model.score(Xtest, ytest)

0.83842925659472423

This is a pretty accurate model. Let's analyze more and see what we can find.

## Bernoulli Classifier
This classifier works with binary data and could be used to classify our data depending on its accuracy. Unfortunately I do not think that it fits our continuous features and may not be able to distinguish a difference among features that correlate to our T/F winning target.

In [20]:
from sklearn import naive_bayes

In [21]:
bernoulli_model = naive_bayes.BernoulliNB()
bernoulli_model.fit(Xtrain, ytrain)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [22]:
bernoulli_model.score(Xtest, ytest)

0.50125899280575537

This model is not very accurate at all. Considering that the dataset I am using has a target classification that is True 50% of the time and False 50% of the time in the data, this model does not do any better trying to help us identify team matchups.

## Gaussian Classifier
The Gaussian classifier assumes that the likelihood of the features is normally distributed.

In [23]:
gaussian_model = naive_bayes.GaussianNB()
gaussian_model.fit(Xtrain, ytrain)

GaussianNB()

In [24]:
gaussian_model.score(Xtest, ytest)

0.87410071942446044

Surprisingly this model is fairly accurate. Let's analyze another.

## K Nearest Neighbors Classifier
This classifier works through a voting procedure. It takes the "K" nearest datasets to the dataset that is trying to be predicted and evaluates on a voting system which classification to give the object to be predicted. This model is probably very accurate and can be tuned given different values of "K".

In [25]:
from sklearn import neighbors

In [26]:
knn_model = neighbors.KNeighborsClassifier(n_neighbors=15)
knn_model.fit(Xtrain, ytrain)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=15, p=2,
           weights='uniform')

In [27]:
knn_model.score(Xtest, ytest)

0.89754196642685846

The more neighbors I give the classifier the more accurate the model becomes. Also a fairly accurate model comparatively.

# Create And Use Our Prediction Function
Now that we have some models that can predict games lets compare them! Let's create a prediction function for each model and compare the predictions that are made from each function.

In [28]:
import ncaa_helper as nh

In [29]:
teams = pd.read_pickle("teams")
season_data_2016 = pd.read_pickle("new_season_detailed_results")

In [30]:
team_season_data = nh.calc_year_data(2016, season_data_2016, teams)

In [31]:
team_season_data[team_season_data.Team_Name == "Cal Poly SLO"]

Unnamed: 0_level_0,Team_Name,Season,wp,ppg,fgp,ftp,fgp3,or,dr,ast,to,stl,blk,pf
Team_Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1142,Cal Poly SLO,2016,0.285714,71.964286,0.413341,0.681895,0.35443,6.107143,15.357143,13.857143,10.428571,2.714286,3.214286,20.75


### The Prediction Function
The following function is a copy of what is in the **ncaa_helper.py** file. This function takes in a classification model, two team names, and the seasonal data for the year we would like to analyze and outputs the probabilities that each team will win the respective matchup.

In [32]:
# This function is not used here in this notebook!!!
def predict_game_outcome(team1, team2, season_data, model):
    output = ""
    feature_cols = nh.feature_columns
    
    team1_stats = list(map(list, season_data[season_data.Team_Name == team1][feature_cols].values))
    team2_stats = list(map(list, season_data[season_data.Team_Name == team2][feature_cols].values))
    
    if len(team1_stats) == 0 or len(team2_stats) == 0:
        return "Error: One of the teams you entered does not exist"
    
    team1_stats = team1_stats[0]
    team2_stats = team2_stats[0]
    
    probs = model.predict_proba([team1_stats + team2_stats])
    output += "There is a " + str(probs[0][1] * 100) + "% chance that " + team1 + " will win this game.\n"
    output += "There is a " + str(probs[0][0] * 100) + "% chance that " + team2 + " will win this game.\n" 
    
    return output

### Testing The Function
In this section we will be testing the prediction function to see some potential matchups among teams this year using a machine learning model.

In [33]:
print(nh.predict_game_outcome("Cal Poly SLO", "Air Force", team_season_data, logistic_regression_model))

There is a 23.5924902075% chance that Cal Poly SLO will win this game.
There is a 76.4075097925% chance that Air Force will win this game.



In [34]:
print(nh.predict_game_outcome("Cal Poly SLO", "Air Force", team_season_data, forest_model))

There is a 36.0% chance that Cal Poly SLO will win this game.
There is a 64.0% chance that Air Force will win this game.



In [35]:
print(nh.predict_game_outcome("Cal Poly SLO", "Air Force", team_season_data, decision_tree_model))

There is a 100.0% chance that Cal Poly SLO will win this game.
There is a 0.0% chance that Air Force will win this game.



In [36]:
print(nh.predict_game_outcome("Cal Poly SLO", "Air Force", team_season_data, bernoulli_model))

There is a 49.9406212572% chance that Cal Poly SLO will win this game.
There is a 50.0593787428% chance that Air Force will win this game.



In [37]:
print(nh.predict_game_outcome("Cal Poly SLO", "Air Force", team_season_data, gaussian_model))

There is a 27.0824493178% chance that Cal Poly SLO will win this game.
There is a 72.9175506822% chance that Air Force will win this game.



In [38]:
print(nh.predict_game_outcome("Cal Poly SLO", "Air Force", team_season_data, knn_model))

There is a 20.0% chance that Cal Poly SLO will win this game.
There is a 80.0% chance that Air Force will win this game.



Through analysis of the outcomes of predicting matchups between teams for Cal Poly and Air Force it is apparent that at least in the case of the decision tree model and the bernoulli model the results are obviously skewed.

The other models are around the same accuracy with one another and the Gaussian and Logistic models seem to have the most exact answers whereas the other models seem to round the prediction.