# Data Mining CS 619, Spring 2018 - Eleonora Renz

### Week 3 - Chapter 3

## Predicting Sports Winners with Decision Trees

#### Using pandas to load the dataset

In [1]:
import pandas as pd

data_filename = "Data/basketball.csv"
dataset = pd.read_csv(data_filename, sep=";")

In [2]:
dataset.head() # Leaving out the "5" from the book since it is the default value

Unnamed: 0,Date,Start (ET),Visitor/Neutral,PTS,Home/Neutral,PTS.1,Unnamed: 6,Unnamed: 7,Notes
0,Tue Oct 27 2015,8:00 pm,Detroit Pistons,106,Atlanta Hawks,94,Box Score,,
1,Tue Oct 27 2015,8:00 pm,Cleveland Cavaliers,95,Chicago Bulls,97,Box Score,,
2,Tue Oct 27 2015,10:30 pm,New Orleans Pelicans,95,Golden State Warriors,111,Box Score,,
3,Wed Oct 28 2015,7:30 pm,Philadelphia 76ers,95,Boston Celtics,112,Box Score,,
4,Wed Oct 28 2015,7:30 pm,Chicago Bulls,115,Brooklyn Nets,100,Box Score,,


#### Cleaning up the dataset

In [3]:
dataset = pd.read_csv(data_filename, sep=";", parse_dates=["Date"])

dataset.columns = ["Date", "Start (ET)", "Visitor Team", "VisitorPts", "Home Team", "HomePts", "OT?", "Score Type", "Notes"]

dataset.head()

Unnamed: 0,Date,Start (ET),Visitor Team,VisitorPts,Home Team,HomePts,OT?,Score Type,Notes
0,2015-10-27,8:00 pm,Detroit Pistons,106,Atlanta Hawks,94,Box Score,,
1,2015-10-27,8:00 pm,Cleveland Cavaliers,95,Chicago Bulls,97,Box Score,,
2,2015-10-27,10:30 pm,New Orleans Pelicans,95,Golden State Warriors,111,Box Score,,
3,2015-10-28,7:30 pm,Philadelphia 76ers,95,Boston Celtics,112,Box Score,,
4,2015-10-28,7:30 pm,Chicago Bulls,115,Brooklyn Nets,100,Box Score,,


In [4]:
print(dataset.dtypes)

Date            datetime64[ns]
Start (ET)              object
Visitor Team            object
VisitorPts               int64
Home Team               object
HomePts                  int64
OT?                     object
Score Type              object
Notes                   object
dtype: object


#### Extacting new features

In [5]:
dataset["HomeWin"] = dataset["VisitorPts"] < dataset["HomePts"]
y_true = dataset["HomeWin"].values

As a baseline accuracy we can look at the accuracy of home wins, because in nearly all sports the home team has an advantage in games.<br>Our baseline accuracy is:

In [6]:
dataset["HomeWin"].mean()

0.5942249240121581

In [7]:
from collections import defaultdict

# Creating new fatures
won_last = defaultdict(int)

# Initialize dictionaries 
dataset["HomeLastWin"] = 0
dataset["VisitorLastWin"] = 0

for index, row in dataset.iterrows():
    home_team = row["Home Team"]
    visitor_team = row["Visitor Team"]
    row["Home Team"] = won_last[home_team]
    dataset.set_value(index, "HomeLastWin", won_last[home_team])
    dataset.set_value(index, "VisitorLastWin", won_last[visitor_team])
    won_last[home_team] = int(row["HomeWin"])
    won_last[visitor_team] = 1 - int(row["HomeWin"])

In [8]:
dataset.head(6)

Unnamed: 0,Date,Start (ET),Visitor Team,VisitorPts,Home Team,HomePts,OT?,Score Type,Notes,HomeWin,HomeLastWin,VisitorLastWin
0,2015-10-27,8:00 pm,Detroit Pistons,106,Atlanta Hawks,94,Box Score,,,False,0,0
1,2015-10-27,8:00 pm,Cleveland Cavaliers,95,Chicago Bulls,97,Box Score,,,True,0,0
2,2015-10-27,10:30 pm,New Orleans Pelicans,95,Golden State Warriors,111,Box Score,,,True,0,0
3,2015-10-28,7:30 pm,Philadelphia 76ers,95,Boston Celtics,112,Box Score,,,True,0,0
4,2015-10-28,7:30 pm,Chicago Bulls,115,Brooklyn Nets,100,Box Score,,,False,0,1
5,2015-10-28,7:30 pm,Utah Jazz,87,Detroit Pistons,92,Box Score,,,True,1,0


In [9]:
dataset.iloc[1000:1005] # Using iloc instead of ix as it is recommended over ix

Unnamed: 0,Date,Start (ET),Visitor Team,VisitorPts,Home Team,HomePts,OT?,Score Type,Notes,HomeWin,HomeLastWin,VisitorLastWin
1000,2016-03-15,7:00 pm,Denver Nuggets,110,Orlando Magic,116,Box Score,,,True,0,0
1001,2016-03-15,8:30 pm,Los Angeles Clippers,87,San Antonio Spurs,108,Box Score,,,True,1,0
1002,2016-03-16,7:00 pm,Oklahoma City Thunder,130,Boston Celtics,109,Box Score,,,False,0,1
1003,2016-03-16,7:00 pm,Orlando Magic,99,Charlotte Hornets,107,Box Score,,,True,0,1
1004,2016-03-16,7:00 pm,Dallas Mavericks,98,Cleveland Cavaliers,99,Box Score,,,True,0,1


### Quick Book Definitions:

<b>Decision trees</b> are a class of supervised learning algorithms like a flow
chart that consists of a sequence of nodes, where the values for a sample
are used to make a decision on the next node to go to.
<br>Decision trees, like most classification methods, are <i>eager
learners</i>, undertaking work at the training stage and therefore needing to do less
in the predicting stage.<br>
One of the most important parameters for a Decision Tree is the stopping criterion. When
the tree building is nearly completed, the final few decisions can often be somewhat
arbitrary and rely on only a small number of samples to make their decision. Using such
specific nodes can result in trees that significantly overfit the training data. Instead, a
stopping criterion can be used to ensure that the Decision Tree does not reach this
exactness.
Instead of using a stopping criterion, the tree could be created in full and then trimmed.
This trimming process removes nodes that do not provide much information to the overall
process. This is known as pruning and results in a model that generally does better on new
datasets because it hasn't overfitted the training data.
<ul><li><i>min_samples_split</i>: This specifies how many samples are needed in order to
create a new node in the Decision Tree
<li><i>min_samples_leaf</i>: This specifies how many samples must be resulting from a
node for it to stay</li>
<li><i>Gini impurity</i>: This is a measure of how often a decision node would incorrectly
predict a sample's class</li>
<li><i>Information gain</i>: This uses information-theory-based entropy to indicate how
much extra information is gained by the decision node</li></ul>

In [10]:
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(random_state = 14)

x_previouswins = dataset[["HomeLastWin", "VisitorLastWin"]].values

from sklearn.model_selection import cross_val_score
import numpy as np

scores = cross_val_score(clf, x_previouswins, y_true, scoring='accuracy')
print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))

Accuracy: 59.4%


#### Putting it all together

In [11]:
import os 

data_folder = "Data/"
standings_filename = os.path.join(data_folder, "standings.csv")

standings = pd.read_csv(standings_filename, skiprows = 1, sep = ";")

standings.head()

Unnamed: 0,Rk,Team,Overall,Home,Road,E,W,A,C,SE,...,Post,?3,?10,Oct,Nov,Dec,Jan,Feb,Mar,Apr
0,1,Golden State Warriors,67-15,39-2,28-13,25 May,42-10,9 Jan,7 Mar,9 Jan,...,25 Jun,5 Mar,45-9,1-0,13 Feb,11 Mar,12 Mar,8 Mar,16 Feb,6 Feb
1,2,Atlanta Hawks,60-22,35-6,25-16,38-14,22 Aug,12 Jun,14 Apr,12 Apr,...,17 Nov,6 Apr,30 Oct,0-1,9 May,14 Feb,17-0,7 Apr,9 Jul,4 Mar
2,3,Houston Rockets,56-26,30 Nov,26-15,23 Jul,33-19,9 Jan,8 Feb,6 Apr,...,20 Sep,8 Apr,31-14,2-0,11 Apr,9 May,11 Jun,7 Mar,10 Jun,6 Feb
3,4,Los Angeles Clippers,56-26,30 Nov,26-15,19 Nov,37-15,7 Mar,6 Apr,6 Apr,...,21 Jul,3 May,33-9,2-0,9 May,11 Jun,11 Apr,5 Jun,11 May,7-0
4,5,Memphis Grizzlies,55-27,31 Oct,24-17,20 Oct,35-17,8 Feb,5 May,7 Mar,...,16-13,9 Mar,26-13,2-0,13 Feb,8 Jun,12 Apr,7 Apr,9 Aug,4 Mar


In [12]:
# Create new feature
dataset["HomeTeamRanksHigher"] = 0
for index, row in dataset.iterrows():
    home_team = row["Home Team"]
    visitor_team = row["Visitor Team"]
    home_rank = standings[standings["Team"] == home_team]["Rk"].values[0]
    visitor_rank = standings[standings["Team"] == visitor_team]["Rk"].values[0]
    row["HomeTeamRankHigher"] = int(home_rank > visitor_rank)
    dataset.set_value(index, "HomeTeamRanksHigher", int(home_rank < visitor_rank))

In [13]:
# Test the results
x_homehigher = dataset[["HomeLastWin", "VisitorLastWin", "HomeTeamRanksHigher"]].values

# Create DecisionTree and run evaluation
clf = DecisionTreeClassifier(random_state = 14)
scores = cross_val_score(clf, x_homehigher, y_true, scoring='accuracy')
print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))

Accuracy: 60.9%


Now we will look at which team won in the last match where they were facing eachother:

In [14]:
last_match_winner = defaultdict(int)
dataset["HomeTeamWonLast"] = 0

for index, row in dataset.iterrows():
    home_Team = row["Home Team"]
    visitor_team = row["Visitor Team"]
    teams = tuple(sorted([home_team, visitor_team])) # Sort for a consistent ordering
    # Set in the row, who won the last encounter
    
    home_team_won_last = 1 if last_match_winner[teams] == row["Home Team"] else 0
    dataset.set_value(index, "HomeTeamWonLast", home_team_won_last)
    # Who won this one?
    winner = row["Home Team"] if row["HomeWin"] else row["Visitor Team"]
    last_match_winner[teams] = winner
    
# Evaluate 
x_lastwinner = dataset[["HomeTeamWonLast", "HomeTeamRanksHigher", "HomeLastWin", "VisitorLastWin",]].values
clf = DecisionTreeClassifier(random_state = 14, criterion = "entropy")

scores = cross_val_score(clf, x_lastwinner, y_true, scoring="accuracy")
print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))

Accuracy: 60.5%


#### Label Encoder

In [15]:
# Transform Strings into assigned integer values
from sklearn.preprocessing import LabelEncoder
encoding = LabelEncoder()
encoding.fit(dataset["Home Team"].values)
home_teams = encoding.transform(dataset["Home Team"].values)
visitor_teams = encoding.transform(dataset["Visitor Team"].values)
x_teams = np.vstack([home_teams, visitor_teams]).T

#### One Hot Encoder

In [16]:
from sklearn.preprocessing import OneHotEncoder
onehot = OneHotEncoder()
x_teams = onehot.fit_transform(x_teams).todense()

In [17]:
clf = DecisionTreeClassifier(random_state = 14)
scores = cross_val_score(clf, x_teams, y_true, scoring="accuracy")
print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))

Accuracy: 62.8%


### Random Forests

We have randomly built trees using randomly chosen samples, using (nearly)
randomly chosen features. This is a <b>random forest</b> and, perhaps unintuitively, this
algorithm is very effective for many datasets, with little need to tune many parameters of
the model.
<b>Bagging</b> is choosing a random subsample of our dataset, effectively creating new training sets.

### How does ensamble work?

<b>Variance</b> is the error introduced by variations in the training dataset on
the algorithm. Algorithms with a high variance (such as decision trees) can
be greatly affected by variations to the training dataset. This results in
models that have the problem of overfitting. <br>In contrast, <b>bias</b> is the error
introduced by assumptions in the algorithm rather than anything to do
with the dataset, that is, if we had an algorithm that presumed that all
features would be normally distributed, then our algorithm may have a
high error if the features were not.

### Applying random forests

In [18]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state = 14)
scores = cross_val_score(clf, x_teams, y_true, scoring="accuracy")
print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))

Accuracy: 65.3%


In [None]:
x_all = np.hstack([x_lastwinner, x_teams])
clf = RandomForestClassifier(random_state = 14)
scores = cross_val_score(clf, x_all, y_true, scoring = "accuracy")
print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))

Accuracy: 61.2%


In [None]:
from sklearn.model_selection import GridSearchCV

parameter_space = {
    "max_features": [2, 10, 'auto'],
    "n_estimators": [100, 200],
    "criterion": ["gini", "entropy"],
    "min_samples_leaf": [2,4,6],
}

clf = RandomForestClassifier(random_state = 14)
grid = GridSearchCV(clf, parameter_space)
grid.fit(x_all, y_true)
print("Accuracy: {0:.1f}%".format(grid.best_score_ * 100))

In [None]:
print(grid.best_estimator_)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion="entropy", max_depth=None, max_features=2, max_leaf_nodes=None, min_samples_leaf=2, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1, oob_score=False, random_state=14, verbose=0, warm_start=False)