## Titanic: Machine Learning from Disaster

**导入库**

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os

**导入数据**

In [2]:
train = pd.read_csv('./train.csv')
test = pd.read_csv('./test.csv')

**查看字段缺失情况**

In [3]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [4]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


**人工选取对预测有效的特征：**

In [5]:
selected_features=['Pclass', 'Sex','Age','Embarked','SibSp','Parch','Fare']

In [6]:
X_train = train[selected_features]
X_test = test[selected_features]

In [7]:
y_train=train['Survived']

**填补Embarked特征存在的缺失值。**

In [8]:
X_train['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [9]:
X_test['Embarked'].value_counts()

S    270
C    102
Q     46
Name: Embarked, dtype: int64

**我们使用出现频率最高的特征值来填充。**

In [10]:
X_train['Embarked'].fillna('S',inplace=True)
X_test['Embarked'].fillna('S',inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)


**用平均值填充Age的缺失值。**

In [11]:
X_train.fillna(X_train.mean()['Age'], inplace=True)
X_test.fillna(X_test.mean()['Age':'Fare'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  downcast=downcast, **kwargs)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)


In [12]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 7 columns):
Pclass      891 non-null int64
Sex         891 non-null object
Age         891 non-null float64
Embarked    891 non-null object
SibSp       891 non-null int64
Parch       891 non-null int64
Fare        891 non-null float64
dtypes: float64(2), int64(3), object(2)
memory usage: 48.8+ KB


In [13]:
X_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 7 columns):
Pclass      418 non-null int64
Sex         418 non-null object
Age         418 non-null float64
Embarked    418 non-null object
SibSp       418 non-null int64
Parch       418 non-null int64
Fare        418 non-null float64
dtypes: float64(2), int64(3), object(2)
memory usage: 22.9+ KB


**采用DictVectorizer对特征向量化**

In [14]:
from sklearn.feature_extraction import DictVectorizer
dict_vec = DictVectorizer(sparse=False)
X_train = dict_vec.fit_transform(X_train.to_dict(orient='record'))
dict_vec.feature_names_

['Age',
 'Embarked=C',
 'Embarked=Q',
 'Embarked=S',
 'Fare',
 'Parch',
 'Pclass',
 'Sex=female',
 'Sex=male',
 'SibSp']

In [15]:
X_test = dict_vec.transform(X_test.to_dict(orient='record'))

**我将使用以下机器学习模型来训练：**
- Logistic Regression（逻辑回归）
- KNN or k-Nearest Neighbors（K近邻）
- Support Vector Machines（支持向量机）
- Naive Bayes classifier（朴素贝叶斯）
- Decision Tree（决策树）
- Random Forrest（随机森林）
- Perceptron（感知机）
- Artificial neural network（人工神经网络）
- RVM or Relevance Vector Machine（相关向量机）

**从sklearn.linear_model中导入Logistic Regression**

使用默认初始化的Logistic Regression.

In [25]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

In [27]:
logreg.fit(X_train, y_train)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, y_train) * 100, 2)
acc_log

80.359999999999999

**从sklearn.svm中导入SVC**

使用默认初始化的SVC.

In [28]:
from sklearn.svm import SVC
svc = SVC()

In [30]:
svc.fit(X_train, y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, y_train) * 100, 2)
acc_svc

89.0

**从sklearn.neighbors中导入KNeighborsClassifier**

使用默认初始化的KNeighborsClassifier.

In [31]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 3)

In [32]:
knn.fit(X_train, y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, y_train) * 100, 2)
acc_knn

83.609999999999999

**从sklearn.naive_bayes中导入GaussianNB**

使用默认初始化的KNeighborsClassifier.

In [34]:
from sklearn.naive_bayes import GaussianNB
gaussian = GaussianNB()

In [35]:
gaussian.fit(X_train, y_train)
Y_pred = gaussian.predict(X_test)
acc_gaussian = round(gaussian.score(X_train, y_train) * 100, 2)
acc_gaussian

79.120000000000005

**从sklearn.linear_model中导入Perceptron**

使用默认初始化的Perceptron.

In [36]:
from sklearn.linear_model import Perceptron
perceptron = Perceptron()

In [37]:
perceptron.fit(X_train, y_train)
Y_pred = perceptron.predict(X_test)
acc_perceptron = round(perceptron.score(X_train, y_train) * 100, 2)
acc_perceptron



67.900000000000006

**从sklearn.svm中导入LinearSVC**

使用默认初始化的LinearSVC.

In [38]:
from sklearn.svm import LinearSVC
linear_svc = LinearSVC()

In [39]:
linear_svc.fit(X_train, y_train)
Y_pred = linear_svc.predict(X_test)
acc_linear_svc = round(linear_svc.score(X_train, y_train) * 100, 2)
acc_linear_svc

68.689999999999998

**从sklearn.linear_model中导入SGDClassifier**

使用默认初始化的SGDClassifier.

In [40]:
from sklearn.linear_model import SGDClassifier
sgd = SGDClassifier()

In [42]:
sgd.fit(X_train, y_train)
Y_pred = sgd.predict(X_test)
acc_sgd = round(sgd.score(X_train, y_train) * 100, 2)
acc_sgd



63.75

**从sklearn.tree中导入DecisionTreeClassifier**

使用默认初始化的DecisionTreeClassifier.

In [43]:
from sklearn.tree import DecisionTreeClassifier
decision_tree = DecisionTreeClassifier()

In [44]:
decision_tree.fit(X_train, y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, y_train) * 100, 2)
acc_decision_tree

98.200000000000003

**从sklearn.ensemble中导入RandomForestClassifier**

使用默认初始化的RandomForestClassifier.

In [48]:
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier(n_estimators=100)

In [49]:
random_forest.fit(X_train, y_train)
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, y_train)
acc_random_forest = round(random_forest.score(X_train, y_train) * 100, 2)
acc_random_forest

98.200000000000003

**从流行工具包xgboost导入XGBClassifier用于处理分类预测问题。**

也使用默认配置初始化XGBClassifier。

In [60]:
from xgboost import XGBClassifier
xgbc = XGBClassifier()

In [61]:
xgbc.fit(X_train, y_train)
Y_pred = xgbc.predict(X_test)
acc_xgbc = round(xgbc.score(X_train, y_train) * 100, 2)
acc_xgbc

87.209999999999994

**所有机器学习模型评分打印为表格并从大到小排序。**

In [76]:
models = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes', 'Perceptron', 
              'Stochastic Gradient Decent', 'Linear SVC', 
              'Decision Tree', 'Xgboost'],
    'Score': [acc_svc, acc_knn, acc_log, 
              acc_random_forest, acc_gaussian, acc_perceptron, 
              acc_sgd, acc_linear_svc, acc_decision_tree, acc_xgbc]})
models.sort_values(by='Score', ascending=False)

Unnamed: 0,Model,Score
3,Random Forest,98.2
8,Decision Tree,98.2
0,Support Vector Machines,89.0
9,Xgboost,87.21
1,KNN,83.61
2,Logistic Regression,80.36
4,Naive Bayes,79.12
7,Linear SVC,68.69
5,Perceptron,67.9
6,Stochastic Gradient Decent,63.75


**使用五折交叉验证的方法在训练集上分别对所有机器学习模型进行性能评估，并获得平均分类准确性的得分。**

In [None]:
from sklearn.cross_validation import cross_val_score

In [53]:
cross_val_score(logreg, X_train, y_train, cv=5).mean()

0.79128522828142689

In [54]:
cross_val_score(svc, X_train, y_train, cv=5).mean()

0.71398254549013807

In [55]:
cross_val_score(knn, X_train, y_train, cv=5).mean()

0.70932382481371825

In [56]:
cross_val_score(gaussian, X_train, y_train, cv=5).mean()

0.78903147649095495

In [57]:
cross_val_score(perceptron, X_train, y_train, cv=5).mean()



0.60247820136769192

In [58]:
cross_val_score(linear_svc, X_train, y_train, cv=5).mean()

0.64332642855648314

In [59]:
cross_val_score(sgd, X_train, y_train, cv=5).mean()



0.69482464455648596

In [51]:
cross_val_score(random_forest, X_train, y_train, cv=5).mean()

0.81712820862001279

In [52]:
cross_val_score(xgbc, X_train, y_train, cv=5).mean()

0.81824559798311003

**使用默认配置的RandomForestClassifier进行预测操作。**

In [62]:
random_forest.fit(X_train, y_train)
random_forest_y_predict = random_forest.predict(X_test)
random_forest_submission = pd.DataFrame({'PassengerId':test['PassengerId'], 'Survived':random_forest_y_predict})

**将默认配置的RandomForestClassifier对测试数据的预测结果存续在文件rfc_submission.csv中。**

In [63]:
random_forest_submission.to_csv('./random_forest_submission.csv', index=False)

使用默认配置的XGBClassifier进行预测操作。

In [67]:
xgbc.fit(X_train, y_train)
xgbc_y_predict = xgbc.predict(X_test)

将默认配置的XGBClassifier对测试数据的预测结果存续文件xgbc_submission.csv中。

In [68]:
xgbc_submission = pd.DataFrame({'PassengerId':test['PassengerId'], 'Survived':xgbc_y_predict})
xgbc_submission.to_csv('./xgbc_submission.csv', index=False)

**使用并行网格搜索的方式寻找更好的超参数组合，以期待进一步提高XGBClassifier的预测性能。**

In [69]:
from sklearn.grid_search import GridSearchCV
params = {'max_depth': [2, 3, 4, 5, 6, 7], 
          'n_estimators': [100, 200, 300, 400, 500, 600]}



In [70]:
xgbc_best = XGBClassifier()
gs = GridSearchCV(xgbc_best, params, n_jobs=-1, cv=5, verbose=1)
gs.fit(X_train, y_train)

Fitting 5 folds for each of 36 candidates, totalling 180 fits


[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    3.2s
[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed:   11.9s finished


GridSearchCV(cv=5, error_score='raise',
       estimator=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'max_depth': [2, 3, 4, 5, 6, 7], 'n_estimators': [100, 200, 300, 400, 500, 600]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=1)

**查验优化之后的XGBClassifier的超参数配置以及交叉验证的准确性。**

In [71]:
print(gs.best_score_)
print(gs.best_params_)

0.835016835016835
{'max_depth': 5, 'n_estimators': 100}


**使用经过优化超参数配置的XGBClassifier对测试数据的预测结果存储在文件xgbc_best_submission.csv中。**

In [72]:
xgbc_best_y_predict = gs.predict(X_test)
xgbc_best_submission = pd.DataFrame({'PassengerId':test['PassengerId'], 'Survived':xgbc_best_y_predict})
xgbc_best_submission.to_csv('./xgbc_best_submission.csv', index=False)