# Decesion Tree

## Summary

### 目的

+ 實作Decision Tree 進行二元分類預測
+ 用底層Mapreduce方式來和MLlib進行比對

### 資料

+ **StumbleUpon Evergreen Classification Challenge**
+ Dataset:https://www.kaggle.com/c/stumbleupon/data
+ Predict the pages is ephemeral or evergreen

## Set up

### colab-environment

In [0]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://www-eu.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"

import findspark
findspark.init()

In [9]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


### local-environment

In [0]:
import os
os.environ['JAVA_HOME'] = 'C:\Program Files\Java\jdk1.8.0_201'

### import and set sc

In [0]:
import numpy as np # for preprocess
import math 

import pyspark
from pyspark import SparkConf, SparkContext

In [4]:
conf = SparkConf().set('spark.driver.host','127.0.0.1').setMaster("local").setAppName("DececisionTree").set("spark.default.parallelism", 4)
sc = SparkContext(conf=conf)
sc

In [0]:
# Parameter
category_Numbers = 14 # 一共14個categories類別
spilt_rate = [9,1] # 用8：2的比例分割資料成 訓練/測試 資料集 
Min_leaf_size = 250
N = 10

## Input

In [16]:
Input = sc.textFile("./data/train.tsv")
Input.count()

7396

In [17]:
Input.first()

'"url"\t"urlid"\t"boilerplate"\t"alchemy_category"\t"alchemy_category_score"\t"avglinksize"\t"commonlinkratio_1"\t"commonlinkratio_2"\t"commonlinkratio_3"\t"commonlinkratio_4"\t"compression_ratio"\t"embed_ratio"\t"framebased"\t"frameTagRatio"\t"hasDomainLink"\t"html_ratio"\t"image_ratio"\t"is_news"\t"lengthyLinkDomain"\t"linkwordscore"\t"news_front_page"\t"non_markup_alphanum_characters"\t"numberOfLinks"\t"numwords_in_url"\t"parametrizedLinkRatio"\t"spelling_errors_ratio"\t"label"'

## Preprocess

### 資料清洗

#### 清洗標題

In [0]:
title = Input.first()
Data = Input.filter(lambda x : x!= title)

In [19]:
Data.first()

'"http://www.bloomberg.com/news/2010-12-23/ibm-predicts-holographic-calls-air-breathing-batteries-by-2015.html"\t"4042"\t"{""title"":""IBM Sees Holographic Calls Air Breathing Batteries ibm sees holographic calls, air-breathing batteries"",""body"":""A sign stands outside the International Business Machines Corp IBM Almaden Research Center campus in San Jose California Photographer Tony Avelar Bloomberg Buildings stand at the International Business Machines Corp IBM Almaden Research Center campus in the Santa Teresa Hills of San Jose California Photographer Tony Avelar Bloomberg By 2015 your mobile phone will project a 3 D image of anyone who calls and your laptop will be powered by kinetic energy At least that s what International Business Machines Corp sees in its crystal ball The predictions are part of an annual tradition for the Armonk New York based company which surveys its 3 000 researchers to find five ideas expected to take root in the next five years IBM the world s largest 

#### 分割資料
+ 原始資料是以`\t`分割,並由`"`包覆

In [20]:
lines = Data.map(lambda x : x.replace("\"","")).map(lambda x : x.split("\t"))
lines.first()[3:]

['business',
 '0.789131',
 '2.055555556',
 '0.676470588',
 '0.205882353',
 '0.047058824',
 '0.023529412',
 '0.443783175',
 '0',
 '0',
 '0.09077381',
 '0',
 '0.245831182',
 '0.003883495',
 '1',
 '1',
 '24',
 '0',
 '5424',
 '170',
 '8',
 '0.152941176',
 '0.079129575',
 '0']

### 提取特徵

#### 建立one-hot encode table

In [0]:
category_with_index = lines.map(lambda x: x[3]).distinct().zipWithIndex()

In [22]:
category_Numbers_list = list(range(category_Numbers))
category_Numbers_array = np.array(category_Numbers_list).reshape(category_Numbers, -1)
category_Numbers_array

array([[ 0],
       [ 1],
       [ 2],
       [ 3],
       [ 4],
       [ 5],
       [ 6],
       [ 7],
       [ 8],
       [ 9],
       [10],
       [11],
       [12],
       [13]])

In [23]:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()
enc.fit(category_Numbers_array)
encoder_table = enc.transform(category_Numbers_array).toarray()

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [24]:
category_Map = category_with_index.map(lambda x : (x[0],encoder_table[x[1]])).collectAsMap()
category_Map

{'?': array([0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.]),
 'arts_entertainment': array([0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]),
 'business': array([1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]),
 'computer_internet': array([0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]),
 'culture_politics': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.]),
 'gaming': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.]),
 'health': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]),
 'law_crime': array([0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]),
 'recreation': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.]),
 'religion': array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.]),
 'science_technology': array([0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.]),
 'sports': array([0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.]),
 'unknown': array([0., 0., 0., 0., 0.

In [0]:
from pyspark.mllib.regression import LabeledPoint
def extract_features(row):
    category_features = category_Map[row[3]]
    number_features = row[4:-2]
    number_features = [0.0 if x=="?" else float(x) for x in number_features]
    
    features = np.concatenate((category_features,number_features))
    label = float(row[-1])
    
    return (label,features)

In [26]:
labelRDD = lines.map(extract_features).map(lambda x: LabeledPoint(x[0],x[1]))
labelRDD.first()

LabeledPoint(0.0, [1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.789131,2.055555556,0.676470588,0.205882353,0.047058824,0.023529412,0.443783175,0.0,0.0,0.09077381,0.0,0.245831182,0.003883495,1.0,1.0,24.0,0.0,5424.0,170.0,8.0,0.152941176])

### 切分資料

In [27]:
(trainRDD,testRDD) = labelRDD.randomSplit(spilt_rate)
print("train: " + str(trainRDD.count()))
print("test:  " + str(testRDD.count()))

train: 6644
test:  751


### 持久化

In [28]:
trainRDD.persist()
testRDD.persist()

PythonRDD[20] at RDD at PythonRDD.scala:53

## Train Model By mllib

+ 用內建的model來做對比

In [0]:
from pyspark.mllib.tree import DecisionTree
model = DecisionTree.trainClassifier(
    data=trainRDD,numClasses=2,categoricalFeaturesInfo={},
    impurity="entropy", maxDepth=5, maxBins=5)

In [30]:
right_count = wrong_count = 0
for test_data in testRDD.take(testRDD.count()):
    ans = test_data.label
    gus = model.predict(test_data.features)
    if ans==gus:
        right_count += 1
    else :
        wrong_count += 1
    print(str(right_count) + ":" + str(wrong_count), end = "\r")



In [31]:
accuracy = right_count/(right_count+wrong_count)*100
print("正確率:%.1f" % accuracy)

正確率:65.8


## Train Model By MapReduce

+ category feature : `0~14` 0 or 1
+ numerical feature: `15~34` float

### 分裂節點的參數計算

#### 計算entropy

In [0]:
# 計算 2-state system 的entropy
def entropy(state1,state2):
    if state1==0 or state2==0:
        return 0
    else:
        p1 = state1/(state1+state2)
        p2 = state2/(state1+state2)
#         p2 = 1-p1
        return -(p1*math.log2(p1)) -(p2*math.log2(p2))

In [0]:
def RDD_entropy(RDD,count="no_value"):
    count = RDD.count() if count=="no_value" else count
    state1 = RDD.filter(lambda x: x[0]).count()
    return entropy(state1,count-state1)

In [36]:
entropy(1,100)

0.08013604733127525

#### 一個feature中尋找產生最大的information gain

In [0]:
def max_split_gain(RDD,sample_node = 0):
    # RDD (label,feature)
    split_points = RDD.values().distinct().collect()
    split_points.sort()
    R0_count = RDD.count()
    R0_entropy = RDD_entropy(RDD,R0_count)
    
    # sample data for fastser
    if sample_node<len(split_points) and sample_node>0:
        sample_rate = int(len(split_points)/sample_node)
        split_points = [split_points[i] for i in range(0,len(split_points),sample_rate)]
    
    # try every point in split_points
    # to get the max information gain
    gain_list = []
    for point in split_points:
        R1 = RDD.filter(lambda x : x[1]<point)
        R2 = RDD.filter(lambda x : x[1]>=point)
        R1_count = R1.count()
        R2_count = R0_count-R1_count
        
        gain = R0_entropy - (R1_count/R0_count)*RDD_entropy(R1,R1_count) - (R2_count/R0_count)*RDD_entropy(R2,R2_count)
        gain_list.append((gain,point))
    
    return max(gain_list) # (gain,split_point)
        

#### 所有feature內找最大information gain

In [0]:
def max_feature_gain(RDD,sample_node=0):
    feature_types = len(RDD.first().features) # 35
    
    gain_list = []
    for feature_index in range(feature_types):
        RDD_one_feature = RDD.map(lambda x: (x.label,x.features[feature_index])) # (key,value)
        one_feature_max_gain = max_split_gain(RDD_one_feature,sample_node)
        print("Now in feature[%d],max gain is %.6f with split at %.3f " % (feature_index,one_feature_max_gain[0],one_feature_max_gain[1]),end="\r")
        gain_list.append((one_feature_max_gain,feature_index))
    
    max_gain = max(gain_list) # ((gain,split_point),feature_index)
    print("Best gain in feature[%d] with split at %.3f is : %.6f" % (max_gain[1],max_gain[0][1],max_gain[0][0]))
    return (max_gain[1],max_gain[0][1])

In [39]:
example_RDD = trainRDD.randomSplit([0.05,0.95])[0] # 取1/10的資料來做示範
max_feature_gain(RDD=example_RDD,sample_node=2)

Best gain in feature[12] with split at 1.000 is : 0.073192


(12, 1.0)

### 建樹

In [0]:
class node:
    def __init__(self,RDD):
        # value
        self.RDD = RDD
        self.count = self.RDD.count()
        self.level = 0
        self.feature = None
        self.split_point = None
        self.predict_value = None
        self.RDD.persist()
        # tree
        self.left = None
        self.right = None
  
    def setLeft(self, left):
        self.left = left
        self.left.level = self.level + 1
        
    def setRight(self, right):
        self.right = right
        self.right.level = self.level + 1
    
    def get_count(self):
        self.count = self.RDD.count()
        return self.count
    
    def get_predict(self):
        label_one_count = self.RDD.map(lambda x: x.label).filter(lambda x: x).count()
        self.predict_value = int(label_one_count/self.count *2  + 1e-9)
        return self.predict_value
    
    def get_split(self):
        (self.feature,self.split_point) = max_feature_gain(self.RDD,sample_node=N)
        print("split at " + str((self.feature,self.split_point)))
        
    def is_leaf(self):
        return self.count <= Min_leaf_size
        
    def build(self):
        if self.is_leaf():
            return
        
        self.get_split()
        (feature_index,split_point_value) = (self.feature,self.split_point)
        print(self.count)
        
        R1 = self.RDD.filter(lambda x : x.features[feature_index]<split_point_value)
        self.setLeft(node(R1))
        print("build left at %s" % str((feature_index,split_point_value)))
        self.left.build()

        R2 = self.RDD.filter(lambda x : x.features[feature_index]>=split_point_value)
        self.setRight(node(R2))
        print("build right at %s" % str((feature_index,split_point_value)))
        self.right.build()
    
    def level_order_print(self):
        
        if self.is_leaf():
            self.get_predict()
            print("\t"*self.level + str(self.predict_value))
            return
        else :
            print("\t"*self.level + str((self.feature,self.split_point)))
        
        print("\t"*self.level + "left")
        self.left.level_order_print()
        print("\t"*self.level + "right")
        self.right.level_order_print()
    
    def predict(self,features):
        if self.is_leaf():
            self.get_predict()
#             print(self.predict_value)
            return self.predict_value
        else:
            if features[self.feature] < self.split_point:
                return self.left.predict(features)
            else :
                return self.right.predict(features)

In [0]:
root = node(trainRDD)

In [42]:
root.build()

Best gain in feature[31] with split at 1562.000 is : 0.035237
split at (31, 1562.0)
6644
build left at (31, 1562.0)
Best gain in feature[1] with split at 1.000 is : 0.014714
split at (1, 1.0)
1657
build left at (1, 1.0)
Best gain in feature[13] with split at 1.000 is : 0.010817
split at (13, 1.0)
1395
build left at (13, 1.0)
Best gain in feature[23] with split at 0.036 is : 0.008670
split at (23, 0.036101083)
1314
build left at (23, 0.036101083)
Best gain in feature[16] with split at 0.780 is : 0.051035
split at (16, 0.780487805)
274
build left at (16, 0.780487805)
Best gain in feature[12] with split at 1.000 is : 0.037525
split at (12, 1.0)
246
build left at (12, 1.0)
Best gain in feature[31] with split at 1508.000 is : 0.021168
split at (31, 1508.0)
206
build left at (31, 1508.0)
build right at (31, 1508.0)
build right at (12, 1.0)
build right at (16, 0.780487805)
build right at (23, 0.036101083)
Best gain in feature[31] with split at 1013.000 is : 0.007487
split at (31, 1013.0)
1040

In [43]:
root.level_order_print()

(31, 1562.0)
left
	(1, 1.0)
	left
		(13, 1.0)
		left
			(23, 0.036101083)
			left
				(16, 0.780487805)
				left
					(12, 1.0)
					left
						(31, 1508.0)
						left
							0
						right
							0
					right
						1
				right
					1
			right
				(31, 1013.0)
				left
					(15, 1.919117647)
					left
						(19, 0.016393443)
						left
							(25, 0.257515128)
							left
								0
							right
								0
						right
							1
					right
						(16, 0.202380952)
						left
							0
						right
							(10, 1.0)
							left
								(31, 610.0)
								left
									0
								right
									0
							right
								0
				right
					(20, 0.52057842)
					left
						0
					right
						0
		right
			1
	right
		(34, 0.285714286)
		left
			0
		right
			0
right
	(12, 1.0)
	left
		(0, 1.0)
		left
			(7, 1.0)
			left
				(29, 24.0)
				left
					(19, 0.106796117)
					left
						(33, 6.0)
						left
							(21, 0.000614)
							left
								(16, 0.384615385)
								left
									(24, 1.0)
									left

## Test Model

In [45]:
right_count = wrong_count = 0
for test_data in testRDD.take(testRDD.count()):
    ans = test_data.label
    gus = root.predict(test_data.features)
    if ans==gus:
        right_count += 1
    else :
        wrong_count += 1
    print(str(right_count) + ":" + str(wrong_count), end = "\r")
print(str(right_count) + ":" + str(wrong_count))

491:260


In [46]:
accuracy = right_count/(right_count+wrong_count)*100
print("正確率:%.1f" % accuracy)

正確率:65.4


## 結論

+ 建立決策樹
    + 用二元樹遞迴建立
    + 選擇最大資訊增益之feature及split_point
    + 設定最小子葉大小
+ 比較預測結果
    + 與使用Mllib的結果差不多(0.4%)
+ 未來改良方向
    + 減枝優化
    + feature篩選優化
    + random forest