# 想要預測當日會不會下雨
featuers：
* 當日最低溫度
* 當日最低氣壓
* 濕度
* 風速

假設：
下雨條件只看這這四種，且發生時都在同一時間。

溫度 < 25(攝氏)

壓力 < 1000(hpa)

濕度 > 60(相對濕度)

風速 > 10(km/hr)

數值範圍：

溫度：15~30

氣壓：980~1030

濕度：30~80

風速：0~25

最多有 $10^6$ 種可能，全部條列太多了

所以用隨機產生 1000 個 data 去跑

In [4]:
import pandas as pd
import numpy
from sklearn.tree import DecisionTreeClassifier
from sklearn.externals.six import StringIO   
from sklearn.tree import export_graphviz
import pydotplus 

In [77]:
temp_df = pd.DataFrame(numpy.random.randint(15,30,size=(1000, 1)), columns=['min_temp'])
press_df = pd.DataFrame(numpy.random.randint(980,1030,size=(1000, 1)), columns=['min_press'])
humidity_df = pd.DataFrame(numpy.random.randint(30,90,size=(1000, 1)), columns=['humidity'])
wind_df = pd.DataFrame(numpy.random.randint(0,25,size=(1000, 1)), columns=['wind_speed'])

In [78]:
df = temp_df
df.insert(1, 'min_press', press_df)
df.insert(2, 'humidity', humidity_df)
df.insert(3, 'wind_speed', wind_df)
df.head(10)

Unnamed: 0,min_temp,min_press,humidity,wind_speed
0,26,986,40,8
1,21,1009,76,17
2,23,994,30,18
3,20,1005,30,2
4,22,995,51,20
5,18,982,83,16
6,18,1016,71,16
7,21,987,41,5
8,21,980,57,21
9,15,1018,85,2


In [79]:
ret = pd.DataFrame(numpy.zeros(1000), columns=['rain'])
for i in range(0, 999):
    if(df['min_temp'][i] < 25 and df['min_press'][i] < 1010 and df['humidity'][i] > 60 and df['wind_speed'][i] > 5):
        ret['rain'][i] = 1

In [80]:
ret.head(10)

Unnamed: 0,rain
0,0.0
1,1.0
2,0.0
3,0.0
4,0.0
5,1.0
6,0.0
7,0.0
8,0.0
9,0.0


In [81]:
df_train = df[:700]
df_train.head()

Unnamed: 0,min_temp,min_press,humidity,wind_speed
0,26,986,40,8
1,21,1009,76,17
2,23,994,30,18
3,20,1005,30,2
4,22,995,51,20


In [82]:
df_test = df[700:]
df_test.head()

Unnamed: 0,min_temp,min_press,humidity,wind_speed
700,29,996,66,16
701,28,1013,53,2
702,24,985,68,2
703,25,983,57,12
704,25,1029,38,9


In [83]:
y = ret['rain'].values

y_train = y[:700]
y_test = y[700:]

In [84]:
dtree=DecisionTreeClassifier(max_depth=6)
dtree.fit(df_train,y_train)

dot_data = StringIO()
export_graphviz(dtree, 
                out_file=dot_data,  
                filled=True, 
                feature_names=list(df_train),
                class_names=['sun','rain'],
                special_characters=True)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_pdf("tree.pdf")

True

In [85]:
dtree.feature_importances_

array([0.3086275 , 0.25423792, 0.19084072, 0.24629386])

In [86]:
y_predict = dtree.predict(df_test)

y_predict

array([0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0.,
       0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 1.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
       0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
       0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0.,
       0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1.,
       0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
       1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 0., 0., 1., 0.,
       0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
       0., 0., 0., 1., 0.

In [87]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_predict)

0.9966666666666667

預測的結果是 $99\%$ 準確，猜測是上面條件設定過於準確，

另外若不設定 max_depth (即不設定參數，讓他自己跑)

若改成：條件符合時有 $80\%$ 機率會下雨

In [88]:
# 條件多加入 numpy.random.randint(0,4) != 0
# 接著重複上面的動作
ret_ran = pd.DataFrame(numpy.zeros(1000), columns=['rain'])
for i in range(0, 999):
    if(df['min_temp'][i] < 25 and df['min_press'][i] < 1010 and df['humidity'][i] > 60 and df['wind_speed'][i] > 5 and numpy.random.randint(0,2) != 0):
        ret_ran['rain'][i] = 1


In [89]:
df_ran_train = df[:700]
df_ran_test = df[700:]

y_ran = ret_ran['rain'].values

y_ran_train = y_ran[:700]
y_ran_test = y_ran[700:]

dtree=DecisionTreeClassifier(max_depth = 6)
dtree.fit(df_ran_train,y_ran_train)

dot_data = StringIO()
export_graphviz(dtree, 
                out_file=dot_data,  
                filled=True, 
                feature_names=list(df_train),
                class_names=['sun','rain'],
                special_characters=True)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
graph.write_pdf("tree_ran.pdf")

True

In [90]:
dtree.feature_importances_

array([0.28325567, 0.25528527, 0.24778824, 0.21367083])

In [91]:
y_ran_predict = dtree.predict(df_test)
from sklearn.metrics import accuracy_score
accuracy_score(y_ran_test, y_ran_predict)

0.9566666666666667

---

其實還是有相當高的準確度，

而且做出來的 decision tree 和沒有 noise 的差不多，

有可能多出一些 leaf node

另外若不設定 max_depth (即不設參數)

有可能會有 overfitting 的問題，準確率有機會降到不到 $90\%$

接著用 random forest 看看有什麼不同

In [92]:
from sklearn import ensemble, metrics
forest = ensemble.RandomForestClassifier(n_estimators = 100)
forest_fit = forest.fit(df_ran_train, y_ran_train)

test_y_predicted = forest.predict(df_test)

accuracy = metrics.accuracy_score(y_ran_test, test_y_predicted)
print(accuracy)

0.9333333333333333


若使用 Random Forest 似乎不設參數可以得到比較高的準確度，

若設定 max_depth 則得到的準確度會比不設還少，

與 decision tree 比較準確度比較低(刷新測資100比較的結果)