## 决策树2：

## 1. 分类树与回归树：

<img src="image_github/regression_tree.png" width="600" height="500">

决策树不仅可以用来解决分类问题，还可以处理回归问题。

## 2. 决策树的4种度量：

> 1. 信息增益 
2. 方差 
3. 基尼指数 
4. 信息增益率 

### 2.1 信息增益 IG - 决策树：

>$$
\text{IG}(D, A) = \text{H}(D) - \text{H}(D|A)
$$

>**对一个确定的数据集来说，H(D)是确定的，那H(D|A)在A特征一定的情况下，随机变量的不确定性越小，信息增益越大，这个特征的表现就越好。**

### 2.2 方差：

>与ID3使用熵构建决策树类似，**通过每次选择最小方差（impurity，variance）构建回归树。**

><img src="image_github/CART.png" width="500" height="500">

例如：

<img src="image_github/CART_2.png" width="400" height="500">
<img src="image_github/CART_1.png" width="550" height="500">
<img src="image_github/CART_3.png" width="550" height="500">
<img src="image_github/CART_4.png" width="550" height="500">

### 2.3 基尼指数 Gini index - CART 分类回归树:

>**表示一个随机选中的样本在子集中被分错的可能性。注意：基尼指数不等于基尼系数。**
<img src="image_github/gini_index.png" width="300" height="400">
CART 是一棵严格二叉树。每次分裂只做二分。

### 2.4 信息增益率 information gain ratio:

><img src="image_github/IGR.png" width="400" height="400">

>**信息增益偏向于选择取值较多的特征（如：winter, summer ...），但根据熵的公式可知，特征越多，熵越大，所以除A特征的熵正好抵消了特征变量的复杂程度。**

## 3. 过拟合：

在小数据集情况下，树的深度越深越容易造成过拟合。

### 3.1 先剪枝 Pre-pruning:

通过提前停止树的构建而对树“剪枝”，一旦停止，节点就成为树叶。该树叶可以持有子集元组中最频繁的类。

### 3.2 后剪枝 Post-pruning:

它首先构造完整的决策树，允许树过度拟合训练数据，然后对那些置信度不够的结点子树用叶子结点来代替，该叶子的类标号用该结点子树中最频繁的类标记。

<img src="image_github/post_pruning.png" width="400" height="400">

## 4. 模型集成:

https://blog.csdn.net/sinat_26917383/article/details/54667077

<img src="image_github/model_ensemble.png" width="700" height="500">

### 4.1 Boosting:

Boosting的思想是一种迭代的方法，每一次训练的时候都更加关心分类错误的样例，给这些分类错误的样例增加更大的权重，下一次迭代的目标就是能够更容易辨别出上一轮分类错误的样例。最终将这些弱分类器进行加权相加。（对数据集进行权重调整）

<img src="image_github/boosting_1.png" width="500" height="500">
<img src="image_github/boosting_2.png" width="500" height="500">


### 4.2 Bagging:

Bagging就是采用有放回的方式进行抽样，用抽样的样本建立子模型,对子模型进行训练，这个过程重复多次，最后进行融合。（例如：随机森林）

* bootstrap samples(random sample): 数据集随机选择的子集。

<img src="image_github/bagging.png" width="500" height="500">

### 总结：

<img src="image_github/boost_bag.png" width="500" height="500">

## 5. pySpark:

In [1]:
import os
import subprocess
def module(*args):        
    if isinstance(args[0], list):        
        args = args[0]        
    else:        
        args = list(args)        
    (output, error) = subprocess.Popen(['/usr/bin/modulecmd', 'python'] + args, stdout=subprocess.PIPE).communicate()
    exec(output)    
module('load', 'apps/java/jdk1.8.0_102/binary')    
os.environ['PYSPARK_PYTHON'] = os.environ['HOME'] + '/.conda/envs/jupyter-spark/bin/python'

In [2]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local[2]") \
    .appName("COM6012 DT") \
    .getOrCreate()

sc = spark.sparkContext

In [3]:
rawdata = spark.read.csv('../Data/winequality-white.csv', sep=';', header='true')
rawdata.cache()

DataFrame[fixed acidity: string, volatile acidity: string, citric acid: string, residual sugar: string, chlorides: string, free sulfur dioxide: string, total sulfur dioxide: string, density: string, pH: string, sulphates: string, alcohol: string, quality: string]

In [5]:
rawdata.printSchema()

root
 |-- fixed acidity: string (nullable = true)
 |-- volatile acidity: string (nullable = true)
 |-- citric acid: string (nullable = true)
 |-- residual sugar: string (nullable = true)
 |-- chlorides: string (nullable = true)
 |-- free sulfur dioxide: string (nullable = true)
 |-- total sulfur dioxide: string (nullable = true)
 |-- density: string (nullable = true)
 |-- pH: string (nullable = true)
 |-- sulphates: string (nullable = true)
 |-- alcohol: string (nullable = true)
 |-- quality: string (nullable = true)



### 数据类型转换：string to double

In [7]:
# string to double using cast.
schemaNames = rawdata.schema.names
ncolumns = len(rawdata.columns)
from pyspark.sql.types import DoubleType
for i in range(ncolumns):
    rawdata = rawdata.withColumn(schemaNames[i], rawdata[schemaNames[i]].cast(DoubleType()))
rawdata = rawdata.withColumnRenamed('quality', 'labels')
rawdata.printSchema()

root
 |-- fixed acidity: double (nullable = true)
 |-- volatile acidity: double (nullable = true)
 |-- citric acid: double (nullable = true)
 |-- residual sugar: double (nullable = true)
 |-- chlorides: double (nullable = true)
 |-- free sulfur dioxide: double (nullable = true)
 |-- total sulfur dioxide: double (nullable = true)
 |-- density: double (nullable = true)
 |-- pH: double (nullable = true)
 |-- sulphates: double (nullable = true)
 |-- alcohol: double (nullable = true)
 |-- labels: double (nullable = true)



### 将数据转换成 Vector:

* **RDD: map(lambda r: [Vectors.dense(r[:-1]), r[-1]]).toDF(['features','label'])**
* **Dataframe: vectorAssembler(A feature transformer that merges multiple columns into a vector column).**

In [8]:
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols = schemaNames[0:ncolumns-1], outputCol = 'features') 
raw_plus_vector = assembler.transform(rawdata)

In [9]:
data = raw_plus_vector.select('features','labels')
data.show(5)

+--------------------+------+
|            features|labels|
+--------------------+------+
|[7.0,0.27,0.36,20...|   6.0|
|[6.3,0.3,0.34,1.6...|   6.0|
|[8.1,0.28,0.4,6.9...|   6.0|
|[7.2,0.23,0.32,8....|   6.0|
|[7.2,0.23,0.32,8....|   6.0|
+--------------------+------+
only showing top 5 rows



### 决策树 - 回归

In [10]:
(trainingData, testData) = data.randomSplit([0.7, 0.3], 50)

from pyspark.ml.regression import DecisionTreeRegressor
dt = DecisionTreeRegressor(labelCol="labels", featuresCol="features", maxDepth=5)
model = dt.fit(trainingData)
predictions = model.transform(testData)

from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator\
      (labelCol="labels", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("RMSE = %g " % rmse)

RMSE = 0.762375 


### 集成学习 - bagging：随机森林 random forest

In [11]:
from pyspark.ml.regression import RandomForestRegressor
rfr = RandomForestRegressor(labelCol="labels", featuresCol="features", maxDepth=5, numTrees=3)
model = rfr.fit(trainingData)
predictions = model.transform(testData)

from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator\
      (labelCol="labels", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("RMSE = %g " % rmse)

RMSE = 0.725299 


### 集成学习 - boosting：梯度提升树  Gradient Boosting or Gradient-boosted trees 

In [12]:
from pyspark.ml.regression import GBTRegressor
gbtr = GBTRegressor(labelCol="labels", featuresCol="features", maxDepth=5, maxIter=5, lossType='squared')
model = gbtr.fit(trainingData)
predictions = model.transform(testData)

from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator \
      (labelCol="labels", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
print("RMSE = %g " % rmse)

RMSE = 0.747853 
