# 第3章 用决策树预测获胜球队

本章主要内容有：
- 用pandas库加载、处理数据
- 决策树
- 随机森林
- 对真实数据集进行数据挖掘
- 创建新特征，用强有力的框架对其进行测试

## 3.1 加载数据集

以往很多对体育赛事预测的研究表明，正确率因体育赛事而异，其上限在70%~80%之间。体育赛事预测多采用数据挖掘或统计学方法。

#### 3.1.1 采集数据

#### 3.1.2 用 pandas 加载数据集

In [1]:
import pandas as pd

In [37]:
dataset = pd.read_csv('NBA.csv')
dataset[:5]

Unnamed: 0,Date,Start (ET),Visitor/Neutral,PTS,Home/Neutral,PTS.1,Unnamed: 6,Unnamed: 7,Notes
0,Tue Oct 29 2013,7:00 pm,Orlando Magic,87,Indiana Pacers,97,Box Score,,
1,Tue Oct 29 2013,10:30 pm,Los Angeles Clippers,103,Los Angeles Lakers,116,Box Score,,
2,Tue Oct 29 2013,8:00 pm,Chicago Bulls,95,Miami Heat,107,Box Score,,
3,Wed Oct 30 2013,7:00 pm,Brooklyn Nets,94,Cleveland Cavaliers,98,Box Score,,
4,Wed Oct 30 2013,8:30 pm,Atlanta Hawks,109,Dallas Mavericks,118,Box Score,,


In [38]:
pwd

'C:\\Users\\hasee\\Documents\\Python Scripts\\MyGit\\Machine Learning\\Learning Data Mining with Python\\第3章 用决策树预测获胜球队'

#### 3.1.3 数据集清洗

In [39]:
dataset.dtypes

Date               object
Start (ET)         object
Visitor/Neutral    object
PTS                 int64
Home/Neutral       object
PTS.1               int64
Unnamed: 6         object
Unnamed: 7         object
Notes              object
dtype: object

In [40]:
dataset = pd.read_csv('NBA.csv', parse_dates=["Date"])

In [41]:
dataset = dataset.reindex(columns=['Date','Unnamed: 6','Visitor/Neutral','PTS','Home/Neutral','PTS.1','Unnamed: 7','Notes'])
dataset.columns = ["Date", "Score Type", "Visitor Team", "VisitorPts", "Home Team", "HomePts", "OT?", "Notes"]

In [42]:
dataset.ix[:5]

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate_ix
  """Entry point for launching an IPython kernel.


Unnamed: 0,Date,Score Type,Visitor Team,VisitorPts,Home Team,HomePts,OT?,Notes
0,2013-10-29,Box Score,Orlando Magic,87,Indiana Pacers,97,,
1,2013-10-29,Box Score,Los Angeles Clippers,103,Los Angeles Lakers,116,,
2,2013-10-29,Box Score,Chicago Bulls,95,Miami Heat,107,,
3,2013-10-30,Box Score,Brooklyn Nets,94,Cleveland Cavaliers,98,,
4,2013-10-30,Box Score,Atlanta Hawks,109,Dallas Mavericks,118,,
5,2013-10-30,Box Score,Washington Wizards,102,Detroit Pistons,113,,


#### 3.1.4 提取新特征

In [43]:
dataset["HomeWin"] = dataset["VisitorPts"] < dataset["HomePts"]
y_true = dataset["HomeWin"].values

In [44]:
from collections import defaultdict
won_last = defaultdict(bool)

In [45]:
dataset[:100]

Unnamed: 0,Date,Score Type,Visitor Team,VisitorPts,Home Team,HomePts,OT?,Notes,HomeWin
0,2013-10-29,Box Score,Orlando Magic,87,Indiana Pacers,97,,,True
1,2013-10-29,Box Score,Los Angeles Clippers,103,Los Angeles Lakers,116,,,True
2,2013-10-29,Box Score,Chicago Bulls,95,Miami Heat,107,,,True
3,2013-10-30,Box Score,Brooklyn Nets,94,Cleveland Cavaliers,98,,,True
4,2013-10-30,Box Score,Atlanta Hawks,109,Dallas Mavericks,118,,,True
5,2013-10-30,Box Score,Washington Wizards,102,Detroit Pistons,113,,,True
6,2013-10-30,Box Score,Los Angeles Lakers,94,Golden State Warriors,125,,,True
7,2013-10-30,Box Score,Charlotte Bobcats,83,Houston Rockets,96,,,True
8,2013-10-30,Box Score,Orlando Magic,115,Minnesota Timberwolves,120,OT,,True
9,2013-10-30,Box Score,Indiana Pacers,95,New Orleans Pelicans,90,,,False


In [46]:
for i in range(len(dataset)):
    home_team = dataset.at[i, 'Home Team']
    visitor_team = dataset.at[i, 'Visitor Team']
    dataset.at[i, 'HomeLastWin'] = won_last[home_team]
    dataset.at[i, 'VisitorLastWin'] = won_last[visitor_team]
    won_last[home_team] = dataset.at[i, 'HomeWin']
    won_last[visitor_team] = not dataset.at[i, 'HomeWin']

In [47]:
dataset.ix[:100]

Unnamed: 0,Date,Score Type,Visitor Team,VisitorPts,Home Team,HomePts,OT?,Notes,HomeWin,HomeLastWin,VisitorLastWin
0,2013-10-29,Box Score,Orlando Magic,87,Indiana Pacers,97,,,True,False,False
1,2013-10-29,Box Score,Los Angeles Clippers,103,Los Angeles Lakers,116,,,True,False,False
2,2013-10-29,Box Score,Chicago Bulls,95,Miami Heat,107,,,True,False,False
3,2013-10-30,Box Score,Brooklyn Nets,94,Cleveland Cavaliers,98,,,True,False,False
4,2013-10-30,Box Score,Atlanta Hawks,109,Dallas Mavericks,118,,,True,False,False
5,2013-10-30,Box Score,Washington Wizards,102,Detroit Pistons,113,,,True,False,False
6,2013-10-30,Box Score,Los Angeles Lakers,94,Golden State Warriors,125,,,True,False,True
7,2013-10-30,Box Score,Charlotte Bobcats,83,Houston Rockets,96,,,True,False,False
8,2013-10-30,Box Score,Orlando Magic,115,Minnesota Timberwolves,120,OT,,True,False,False
9,2013-10-30,Box Score,Indiana Pacers,95,New Orleans Pelicans,90,,,False,False,True


### 3.2 决策树

![Alt text](./decision_tree.png)

In [48]:
pwd

'C:\\Users\\hasee\\Documents\\Python Scripts\\MyGit\\Machine Learning\\Learning Data Mining with Python\\第3章 用决策树预测获胜球队'

#### 3.2.1 决策树中的参数

scikit-learn库实现的决策树算法给出了退出方法，使用下面这两个选项就可以达到目的。
- min_samples_split：指定创建一个新节点至少需要的个体数量。
- min_samples_leaf：指定为了保留节点，每个节点至少应该包含的个体数量。

第一个参数控制着决策节点的创建，第二个参数决定着决策节点能否被保留。

决策树的另一个参数是创建决策的标准，常用的有以下两个。
- 基尼不纯度（Gini impurity）：用于衡量决策节点错误预测新个体类别的比例。
- 信息增益（Information gain）：用信息论中的熵来表示决策节点提供多少新信息。

#### 3.2.2 使用决策树

In [49]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=14)

In [53]:
from sklearn.cross_validation import cross_val_score
import numpy as np

In [54]:
X_previouswins = dataset[["HomeLastWin", "VisitorLastWin"]].values
scores = cross_val_score(clf, X_previouswins, y_true, scoring='accuracy')
print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))

Accuracy: 57.5%


## 3.3 NBA 比赛结果预测

In [55]:
standings = pd.read_csv("NBA1.csv")

In [56]:
standings

Unnamed: 0,Rk,Team,Overall,Home,Road,E,W,A,C,SE,...,Post,≤3,≥10,Oct,Nov,Dec,Jan,Feb,Mar,Apr
0,1,Miami Heat,66-16,37-4,29-12,41-11,25-5,14-4,12-6,15-1,...,30-2,9-3,39-8,1-0,10-3,10-5,8-5,12-1,17-1,8-1
1,2,Oklahoma City Thunder,60-22,34-7,26-15,21-9,39-13,7-3,8-2,6-4,...,21-8,3-6,44-6,,13-4,11-2,11-5,7-4,12-5,6-2
2,3,San Antonio Spurs,58-24,35-6,23-18,25-5,33-19,8-2,9-1,8-2,...,16-12,9-5,31-10,1-0,12-4,12-4,12-3,8-3,10-4,3-6
3,4,Denver Nuggets,57-25,38-3,19-22,19-11,38-14,5-5,10-0,4-6,...,24-4,11-7,28-8,0-1,8-8,9-6,12-3,8-4,13-2,7-1
4,5,Los Angeles Clippers,56-26,32-9,24-17,21-9,35-17,7-3,8-2,6-4,...,17-9,3-5,38-12,1-0,8-6,16-0,9-7,8-5,7-7,7-1
5,6,Memphis Grizzlies,56-26,32-9,24-17,22-8,34-18,8-2,8-2,6-4,...,23-8,6-4,28-9,0-1,12-1,7-7,10-7,9-2,11-6,7-2
6,7,New York Knicks,54-28,31-10,23-18,37-15,17-13,10-6,12-6,15-3,...,22-10,7-5,31-12,,11-4,10-5,7-6,6-5,12-6,8-2
7,8,Brooklyn Nets,49-33,26-15,23-18,36-16,13-17,11-5,13-5,12-6,...,18-11,9-4,23-17,,11-4,5-11,11-4,7-5,8-7,7-2
8,9,Indiana Pacers,49-32,30-11,19-21,31-20,18-12,6-11,13-3,12-6,...,17-11,4-9,27-14,1-0,7-8,10-5,9-6,9-3,11-5,2-5
9,10,Golden State Warriors,47-35,28-13,19-22,19-11,28-24,7-3,5-5,7-3,...,17-13,5-3,20-18,1-0,8-6,12-4,8-7,4-8,9-7,5-3


In [58]:
dataset["HomeTeamRanksHigher"] = 0
for i in range(len(standings)):
    home_team = dataset.at[i, 'Home Team']
    visitor_team = dataset.at[i, 'Visitor Team']
    if home_team == "New Orleans Pelicans":
        home_team = "New Orleans Hornets"
    elif visitor_team == "New Orleans Pelicans":
        visitor_team = "New Orleans Hornets"
    home_rank = standings[standings["Team"] == home_team]["Rk"].values[0]
    visitor_rank = standings[standings["Team"] == visitor_team]["Rk"].values[0]
    dataset.at[i, 'HomeTeamRanksHigher'] = int(home_rank > visitor_rank)
    


In [59]:
dataset[:5]

Unnamed: 0,Date,Score Type,Visitor Team,VisitorPts,Home Team,HomePts,OT?,Notes,HomeWin,HomeLastWin,VisitorLastWin,HomeTeamRanksHigher
0,2013-10-29,Box Score,Orlando Magic,87,Indiana Pacers,97,,,True,False,False,0
1,2013-10-29,Box Score,Los Angeles Clippers,103,Los Angeles Lakers,116,,,True,False,False,1
2,2013-10-29,Box Score,Chicago Bulls,95,Miami Heat,107,,,True,False,False,0
3,2013-10-30,Box Score,Brooklyn Nets,94,Cleveland Cavaliers,98,,,True,False,False,1
4,2013-10-30,Box Score,Atlanta Hawks,109,Dallas Mavericks,118,,,True,False,False,1


In [60]:
X_homehigher = dataset[["HomeLastWin", "VisitorLastWin","HomeTeamRanksHigher"]].values

In [62]:
clf = DecisionTreeClassifier(random_state=14)
scores = cross_val_score(clf, X_homehigher, y_true, scoring='accuracy')
print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))

Accuracy: 57.5%


## 3.4 随机森林

#### 3.4.1 决策树的集成效果如何

#### 3.4.2 随机森林算法的参数

scikit-learn 库中的 RandomForestClassifier 就是对随机森林算法的实现，它提供了一系列参数。因为它使用了 DecisionTreeClassifier 的大量实例，所以它俩的很多参数是一致的，比如决策标准（基尼不纯度/信息增益）、 max_features 和 min_samples_split。
集成过程还引入了一些新参数。
- n_estimators：用来指定创建决策树的数量。该值越高，所花时间越长，正确率（可能）也越高。
- oob_score：如果设置为真，测试时将不使用训练模型时用过的数据。
- n_jobs：采用并行计算方法训练决策树时所用到的内核数量。

scikit-learn 库提供了用于并行计算的 Joblib 库。 n_jobs 指定所用的内核数。默认使用1
个内核——如果CPU是多核的，可以多用几个，或者将其设置为-1，开动全部马力。

#### 3.4.3 使用随机森林算法

In [61]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=14)
scores = cross_val_score(clf, X_teams, y_true, scoring='accuracy')
print("Accuracy: {0:.1f}%".format(np.mean(scores) * 100))

NameError: name 'X_teams' is not defined