### Kaggle Competition

Address: [Titanic](https://www.kaggle.com/c/titanic)

Baseline = Dataprocessing + logistic regression

Reference: https://blog.csdn.net/han_xiaoyang/article/details/49797143

In [82]:
import re
import numpy as np
import pandas as pd
import seaborn as sns

data_train = pd.read_csv('../Dataset/Titanic/train.csv')
data_test = pd.read_csv('../Dataset/Titanic/test.csv')
data_train.info()
data_train.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [72]:
# overview of the data.

 #### Data Preprocessing: 1.1 数据缺失
 在对数据做处理之前, 首先可以看到Cabin和Age有一定的缺失率. 
 一般情况下
 1. 对于缺失率极高的特征可以考虑舍弃,作为特征加入可能反而是带来noise
 2. 如果缺失的适中,而且是非连续值的特征属性,可以考虑NaN作为新类别加入.
 3. 如果缺失的适中,而且是连续值的特征属性,可以考虑一个step(比如这里的age，考虑每3岁一个步长然后离散化)后,数据离散化
 4. 或者有时候缺失的不多, 我们来尝试拟合一下
 
 
 **Solution**
 
 * 在这个例子里,我们尝试用Yes/No 来表示Cabin 有无缺失;
 
 * 用scikit-learn中的RandomForest来拟合一下缺失的Age数据

In [73]:
from sklearn.ensemble import RandomForestRegressor
def set_missing_ages(df):
    # all Numerical feature
    age_data = df[['Age','Fare','Parch','SibSp','Pclass']]
    print(type(age_data)) # pandas.core.frame.DataFrame
    
    # group passengers int Age known and unknown
    known_age = age_data[age_data.Age.notnull()].values
    unknown_age = age_data[age_data.Age.isnull()].values
    
    y = known_age[:, 0]
    X = known_age[:, 1:]
    rfr = RandomForestRegressor(random_state=0, n_estimators=2000, n_jobs=-1)
    rfr.fit(X, y)
    
    # use model to predict unknown age 用得到的模型进行未知年龄结果预测
    predictedAges = rfr.predict(unknown_age[:, 1::])
    
    # 用得到的预测结果填补原缺失数据
    df.loc[ (df.Age.isnull()), 'Age' ] = predictedAges 
    
    return df, rfr

def set_Cabin_type(df):
    df.loc[ (df.Cabin.notnull()), 'Cabin' ] = "Yes"
    df.loc[ (df.Cabin.isnull()), 'Cabin' ] = "No"
    return df

data_train, rfr = set_missing_ages(data_train)
data_train = set_Cabin_type(data_train)
data_train.head(10)

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,No,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,Yes,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,No,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,Yes,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,No,S
5,6,0,3,"Moran, Mr. James",male,23.838953,0,0,330877,8.4583,No,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,Yes,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,No,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,No,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,No,C


 #### Data Preprocessing: 1.2 数值类型的改变
那这里可以看到已经完成了 Age的拟合，接下来尝试用logistics regression 来进行预测，那logistics regression就需要数值型的特征

In [77]:
# Pandas.get_dummies Convert categorical variable into dummy/indicator variables.
# Cabin Yes,No -> 1,0
Cabin_dummies = pd.get_dummies(data_train['Cabin'], prefix = 'Cabin')
Embarked_dummies = pd.get_dummies(data_train['Embarked'], prefix = 'Embarked')
Sex_dummies = pd.get_dummies(data_train['Sex'], prefix = 'Sex')
Pclass_dummies = pd.get_dummies(data_train['Pclass'], prefix='Pclass')

df = pd.concat([data_train,Cabin_dummies, Embarked_dummies, Sex_dummies, Pclass_dummies], axis=1)
df.drop(['Cabin','Embarked', 'Sex', 'Pclass', 'Name', 'Ticket'], axis=1, inplace=True)
# df



 #### Data Preprocessing: 1.3 Feature scaling 特征归一化

注意到上面 Age 和 Fare 的数值比较大，对于logistic regression来说容易造成收敛速度减慢，通常对这些feature做一个scaling，即把变化较大的feature特征化到[-1,1] 之间

In [78]:
import sklearn.preprocessing as preprocessing

scaler = preprocessing.StandardScaler()
age_scale_param = scaler.fit(df['Age'].values.reshape(-1,1))
df['Age_scaled'] = scaler.fit_transform(df['Age'].values.reshape(-1,1), age_scale_param)
fare_scale_param = scaler.fit(df['Fare'].values.reshape(-1,1))
df['Fare_scaled'] = scaler.fit_transform(df['Fare'].values.reshape(-1,1), fare_scale_param)
df.head(5)

Unnamed: 0,PassengerId,Survived,Age,SibSp,Parch,Fare,Cabin_No,Cabin_Yes,Embarked_C,Embarked_Q,Embarked_S,Sex_female,Sex_male,Pclass_1,Pclass_2,Pclass_3,Age_scaled,Fare_scaled
0,1,0,22.0,1,0,7.25,1,0,0,0,1,0,1,0,0,1,-0.56138,-0.502445
1,2,1,38.0,1,0,71.2833,0,1,1,0,0,1,0,1,0,0,0.613171,0.786845
2,3,1,26.0,0,0,7.925,1,0,0,0,1,1,0,0,0,1,-0.267742,-0.488854
3,4,1,35.0,1,0,53.1,0,1,0,0,1,1,0,1,0,0,0.392942,0.42073
4,5,0,35.0,0,0,8.05,1,0,0,0,1,0,1,0,0,1,0.392942,-0.486337


**Data Preprocessing: 1.4 For test data**

In [84]:
data_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


In [85]:
data_test.loc[ (data_test.Fare.isnull()), 'Fare' ] = 0

# test_data做和train_data中一致的特征变换
tmp_df = data_test[['Age','Fare', 'Parch', 'SibSp', 'Pclass']]
null_age = tmp_df[data_test.Age.isnull()].values

# Use randomforest model to predict Age for test data
X = null_age[:, 1:]
predictedAges = rfr.predict(X)
data_test.loc[ (data_test.Age.isnull()), 'Age' ] = predictedAges

data_test = set_Cabin_type(data_test)
# Feature scaling
dummies_Cabin = pd.get_dummies(data_test['Cabin'], prefix= 'Cabin')
dummies_Embarked = pd.get_dummies(data_test['Embarked'], prefix= 'Embarked')
dummies_Sex = pd.get_dummies(data_test['Sex'], prefix= 'Sex')
dummies_Pclass = pd.get_dummies(data_test['Pclass'], prefix= 'Pclass')


df_test = pd.concat([data_test, dummies_Cabin, dummies_Embarked, dummies_Sex, dummies_Pclass], axis=1)
df_test.drop(['Pclass', 'Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], axis=1, inplace=True)
df_test['Age_scaled'] = scaler.fit_transform(df_test['Age'].values.reshape(-1,1), age_scale_param)
df_test['Fare_scaled'] = scaler.fit_transform(df_test['Fare'].values.reshape(-1,1), fare_scale_param)
df_test.head()

Unnamed: 0,PassengerId,Age,SibSp,Parch,Fare,Cabin_No,Cabin_Yes,Embarked_C,Embarked_Q,Embarked_S,Sex_female,Sex_male,Pclass_1,Pclass_2,Pclass_3,Age_scaled,Fare_scaled
0,892,34.5,0,0,7.8292,1,0,0,1,0,0,1,0,0,1,0.307521,-0.496637
1,893,47.0,1,0,7.0,1,0,0,0,1,1,0,0,0,1,1.256241,-0.511497
2,894,62.0,0,0,9.6875,1,0,0,1,0,0,1,0,1,0,2.394706,-0.463335
3,895,27.0,0,0,8.6625,1,0,0,0,1,0,1,0,0,1,-0.261711,-0.481704
4,896,22.0,1,1,12.2875,1,0,0,0,1,1,0,0,0,1,-0.641199,-0.41674


#### Model Building 2.1


In [80]:
from sklearn import linear_model
# use regex to choose the attribute that we need
train_df = df.filter(regex='Survived|Age_.*|SibSp|Parch|Fare_.*|Cabin_.*|Embarked_.*|Sex_.*|Pclass_.*')
# from dataframe to numpy array
train_np = train_df.values

y = train_np[:,0]
X = train_np[:,1:]

# get a model
clf = linear_model.LogisticRegression(solver='liblinear',C=1.0, penalty='l1', tol=1e-6)
clf.fit(X, y)

clf

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=1e-06,
          verbose=0, warm_start=False)

#### Get Baseline result 3.1

In [86]:
test = df_test.filter(regex='Age_.*|SibSp|Parch|Fare_.*|Cabin_.*|Embarked_.*|Sex_.*|Pclass_.*')
predictions = clf.predict(test)
result = pd.DataFrame({'PassengerId':data_test['PassengerId'].values, 'Survived':predictions.astype(np.int32)})
result.to_csv("../Dataset/titanic/logistic_regression_predictions.csv", index=False)


这里根据Baseline 得到了第一个结果，在kaggle上测试得到的结果是 0.76555
