## Competition Description
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

In [71]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

%matplotlib inline

In [72]:
train_data = pd.read_csv("train.csv")

In [73]:
# 抽离Label项：Survived
y_train = train_data["Survived"].copy()
x_train = train_data.drop(columns=["Survived"])

In [74]:
# 数据预处理 num pipeline
from sklearn.base import BaseEstimator
class NumberPreprocesser(BaseEstimator):
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        # 数字项：Pclass, Age, SibSp, Parch, Fare
        x_num = X[["Pclass", "Age", "Fare", "Sex"]].copy()
        # 调整船舱级别数值，越高级数字越高
        x_num["Pclass"].replace({1:3, 3:1}, inplace=True)
        # Sex 分类转换为数字项
        x_num["Sex"].replace({"male":0, "female":1}, inplace=True)
        # 添加计算属性
        x_num["Parch_b"] = X["Parch"] > 0
        x_num["SibSp_b"] = X["SibSp"] > 0
        # x_train_num["single_dog"] = (X["Parch"] == 0) & (X["SibSp"] == 0)
        x_num["Has_family"] = (X["Parch"] > 0) | (X["SibSp"] > 0)
        return x_num

In [75]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
num_pipeline = Pipeline([
    ("NumberPreprecess", NumberPreprocesser()),
    ("Imputer", Imputer(strategy="median")),
    # 考虑这里加上Scaler
])
# x_train_num = num_pipeline.fit_transform(x_train)
# num_attribs = ["Pclass", "Age", "Fare", "Sex", "Parch_b", "SibSp_b", "Has_family"]
# x_num_df = pd.DataFrame(x_train_num, columns=num_attribs)
# x_num_df.info()

In [76]:
# 分类数据预处理
class CagetoryPreprosser(BaseEstimator):
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        # process Embarked category
        embarked = X[["Embarked"]].copy()
        embarked = pd.get_dummies(embarked)
        # process Cabin
        cabin = x_train["Cabin"].copy()
        cabin.replace(to_replace="A.*", value="A", regex=True, inplace=True)
        cabin.replace(to_replace="B.*", value="B", regex=True, inplace=True)
        cabin.replace(to_replace="C.*", value="C", regex=True, inplace=True)
        cabin.replace(to_replace="D.*", value="D", regex=True, inplace=True)
        cabin.replace(to_replace="E.*", value="E", regex=True, inplace=True)
        cabin.replace(to_replace="F.*", value="F", regex=True, inplace=True)
        cabin.replace(to_replace="G.*", value="G", regex=True, inplace=True)
        cabin.replace(to_replace="T.*", value="T", regex=True, inplace=True)
        cabin = pd.get_dummies(cabin)
        return np.hstack((embarked.values, cabin.values))

In [77]:
cat_pipeline = Pipeline([
    ("CagetoryPreprosser", CagetoryPreprosser()),
])

# x_train_cat = cat_pipeline.fit_transform(x_train)
# cat_attribs = ["Embarked_C","Embarked_Q","Embarked_S","Cabin_A","Cabin_B","Cabin_C","Cabin_D","Cabin_E","Cabin_F","Cabin_G","Cabin_T"]
# x_cat_df = pd.DataFrame(x_train_cat, columns = cat_attribs)
# x_cat_df.info()

In [81]:
# 数据预处理 Full  pipeline
from sklearn.pipeline import FeatureUnion
full_pipeline = FeatureUnion(transformer_list=[
    ("num_pipeline", num_pipeline),
    ("cat_pipeline", cat_pipeline),
])
x_train_prepared = full_pipeline.fit_transform(x_train)
full_attribs = num_attribs + cat_attribs
x_prepared_df = pd.DataFrame(x_train_prepared, columns = full_attribs)
x_prepared_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 18 columns):
Pclass        891 non-null float64
Age           891 non-null float64
Fare          891 non-null float64
Sex           891 non-null float64
Parch_b       891 non-null float64
SibSp_b       891 non-null float64
Has_family    891 non-null float64
Embarked_C    891 non-null float64
Embarked_Q    891 non-null float64
Embarked_S    891 non-null float64
Cabin_A       891 non-null float64
Cabin_B       891 non-null float64
Cabin_C       891 non-null float64
Cabin_D       891 non-null float64
Cabin_E       891 non-null float64
Cabin_F       891 non-null float64
Cabin_G       891 non-null float64
Cabin_T       891 non-null float64
dtypes: float64(18)
memory usage: 125.4 KB


In [82]:
x_prepared_df.head()

Unnamed: 0,Pclass,Age,Fare,Sex,Parch_b,SibSp_b,Has_family,Embarked_C,Embarked_Q,Embarked_S,Cabin_A,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T
0,1.0,22.0,7.25,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,3.0,38.0,71.2833,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,1.0,26.0,7.925,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3.0,35.0,53.1,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,1.0,35.0,8.05,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [84]:
# SGDClassifier
from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier()
sgd_clf.fit(x_train_prepared, y_train)



SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=None, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=None,
       shuffle=True, tol=None, verbose=0, warm_start=False)

In [88]:
test_predict = sgd_clf.predict(x_train_prepared)
num_correct = sum(test_predict == y_train)
print("Prediction: ", num_correct / len(y_train))

Prediction:  0.6161616161616161


In [None]:
# Cross Validation