集成（Ensemble）

集成（Ensemble）分類模型是綜合考量多個分類器的預測結果，從而做出決策。
一般分為兩種方式：

1）Bagging:
利用相同的訓練數據同時搭建多個獨立的分類模型，然後通過投票的方式，以少數服從多數的原則做出最終的分類決策。如隨機森林分類器的是在相同的訓練數據上同時搭建多棵決策樹；在構建每一棵決策樹會隨機選擇特徵。

2）Boosting:
按照一定次序搭建多個分類模型，因此這些模型之間彼此存在依賴關係。一般而言，每一個後續模型的加入都需要對現有集成模型的綜合性能有所貢獻，進而不斷提升更新過後的集成模型的性能，最終期望借助整合多個分類能力較弱的分類器，搭建出具有更強分類能力的模型。如梯度提升樹：它生成每一棵決策樹的過程中都會盡可能降低整體集成模型在訓練集上的擬合誤差。

In [21]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.metrics import classification_report

In [2]:
#匯入資料
df = pd.read_csv('C:/Users/USER/Desktop/Github/Python Project/train.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [6]:
#前處理
x = df[['Pclass', 'Age', 'Sex']]

In [12]:
x['Age'] = x['Age'].fillna(x['Age'].mean())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [8]:
y = df['Survived']

In [13]:
x.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 3 columns):
Pclass    891 non-null int64
Age       891 non-null float64
Sex       891 non-null object
dtypes: float64(1), int64(1), object(1)
memory usage: 21.0+ KB


In [15]:
y

0      0
1      1
2      1
3      1
4      0
      ..
886    0
887    1
888    0
889    1
890    0
Name: Survived, Length: 891, dtype: int64

In [17]:
#訓練集測試集分割
#數據標準化
X_train,X_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state=33)  # 將數據進行分割

vec = DictVectorizer(sparse=False)
X_train = vec.fit_transform(X_train.to_dict(orient='record'))   
X_test = vec.transform(X_test.to_dict(orient='record'))         

In [19]:
X_train  

#numpy.ndarray

array([[47.        ,  1.        ,  0.        ,  1.        ],
       [40.        ,  3.        ,  0.        ,  1.        ],
       [29.69911765,  3.        ,  0.        ,  1.        ],
       ...,
       [25.        ,  2.        ,  0.        ,  1.        ],
       [21.        ,  3.        ,  0.        ,  1.        ],
       [35.        ,  2.        ,  0.        ,  1.        ]])

In [20]:
X_test 

#numpy.ndarray

array([[ 2.        ,  1.        ,  1.        ,  0.        ],
       [16.        ,  3.        ,  1.        ,  0.        ],
       [51.        ,  3.        ,  0.        ,  1.        ],
       [29.        ,  3.        ,  1.        ,  0.        ],
       [34.        ,  2.        ,  0.        ,  1.        ],
       [45.        ,  1.        ,  0.        ,  1.        ],
       [27.        ,  1.        ,  0.        ,  1.        ],
       [24.        ,  2.        ,  1.        ,  0.        ],
       [14.        ,  3.        ,  0.        ,  1.        ],
       [18.        ,  2.        ,  1.        ,  0.        ],
       [40.        ,  1.        ,  1.        ,  0.        ],
       [36.        ,  1.        ,  0.        ,  1.        ],
       [18.        ,  3.        ,  0.        ,  1.        ],
       [45.        ,  3.        ,  1.        ,  0.        ],
       [29.69911765,  3.        ,  1.        ,  0.        ],
       [17.        ,  3.        ,  0.        ,  1.        ],
       [19.        ,  3.

Training:



In [23]:
#使用單一決策樹進行模型訓練以及預測分析:
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
dtc_y_pred = dtc.predict(X_test)

#使用隨機森林分類器進行集成模型的訓練以及預測分析:
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
rfc_y_pred = rfc.predict(X_test)

#使用梯度提升樹進行集成模型的訓練以及預測分析:
gbc = GradientBoostingClassifier()
gbc.fit(X_train, y_train)
gbc_y_pred = gbc.predict(X_test)

#使用極限梯度提升樹進行集成模型的訓練以及預測分析:
xgbc = XGBClassifier()
xgbc.fit(X_train,y_train)
xgbc_y_pred = gbc.predict(X_test)



Evaluating:



In [31]:

# 輸出單一決策樹在測試集上的分類準確性，以及更加詳細的精確率、召回率、F1指標。
print ('The accuracy of decision tree is', dtc.score(X_test, y_test))
print (classification_report(dtc_y_pred, y_test))
print('-------------------------------------------------------')

# 輸出隨機森林分類器在測試集上的分類準確性，以及更加詳細的精確率、召回率、F1指標。
print('The accuracy of random forest classifier is', rfc.score(X_test, y_test))
print (classification_report(rfc_y_pred, y_test))
print('-------------------------------------------------------')

# 輸出梯度提升樹在測試集上的分類準確性，以及更加詳細的精確率、召回率、F1指標。
print( 'The accuracy of gradient tree boosting is', gbc.score(X_test, y_test))
print (classification_report(gbc_y_pred, y_test))
print('-------------------------------------------------------')

# 輸出極限梯度提升樹在測試集上的分類準確性，以及更加詳細的精確率、召回率、F1指標。
print( 'The accuracy of extreme gradient tree boosting is', xgbc.score(X_test, y_test))
print (classification_report(gbc_y_pred, y_test))

The accuracy of decision tree is 0.8340807174887892
              precision    recall  f1-score   support

           0       0.90      0.84      0.87       143
           1       0.74      0.82      0.78        80

    accuracy                           0.83       223
   macro avg       0.82      0.83      0.82       223
weighted avg       0.84      0.83      0.84       223

-------------------------------------------------------
The accuracy of random forest classifier is 0.820627802690583
              precision    recall  f1-score   support

           0       0.87      0.84      0.85       140
           1       0.74      0.80      0.77        83

    accuracy                           0.82       223
   macro avg       0.81      0.82      0.81       223
weighted avg       0.82      0.82      0.82       223

-------------------------------------------------------
The accuracy of gradient tree boosting is 0.8430493273542601
              precision    recall  f1-score   support

    