# 泰坦尼克号数据集调查

## 1 查看数据集结构

In [1]:
import pandas as pd

file_path = './titanic-data.csv'
original_df = pd.read_csv(file_path)
original_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


#### [Kaggle网站](https://www.kaggle.com/c/titanic/data)上的数据集结构说明：

**Data Dictionary**

|Variable|Definition|Key|
|:-|:-|:-|
|survival|Survival|0 = No, 1 = Yes|
|pclass|Ticket class|1 = 1st, 2 = 2nd, 3 = 3rd|
|sex|Sex||	
|Age|Age in years||	
|sibsp|# of siblings / spouses aboard the Titanic||
|parch|# of parents / children aboard the Titanic||
|ticket|Ticket number||	
|fare|Passenger fare||	
|cabin|Cabin number||
|embarked|Port of Embarkation|C = Cherbourg, Q = Queenstown, S = Southampton|


**Variable Notes**

**pclass**: A proxy for socio-economic status (SES)
<br>1st = Upper
<br>2nd = Middle
<br>3rd = Lower

**age**: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

**sibsp**: The dataset defines family relations in this way...
<br>Sibling = brother, sister, stepbrother, stepsister
<br>Spouse = husband, wife (mistresses and fiancés were ignored)

**parch**: The dataset defines family relations in this way...
<br>Parent = mother, father
<br>Child = daughter, son, stepdaughter, stepson
<br>Some children travelled only with a nanny, therefore parch=0 for them.


In [3]:
# 数据行数
print len(original_df)


891


## 2 定义调查问题 
### 2.1 有哪些因素会让船上的人生还率更高？
* 舱位等级（Pclass）
* 年龄
* 性别
* 有亲属同船的
* 登船港口

### 2.2 有哪些因素与社会地位/舱位（Pclass）相关?
* 年龄
* 性别
* 有亲属同船
* 登船港口

## 3 数据加工（缺失数据处理）
### 3.1 查看各列缺失值数量

In [14]:
print '%16s    %s' % ('列名', '缺失值数量')
print '='*30
for col in original_df:
    print '%16s    %d' % (col,sum(original_df[col].isna()))


          列名    缺失值数量
     PassengerId    0
        Survived    0
          Pclass    0
            Name    0
             Sex    0
             Age    177
           SibSp    0
           Parch    0
          Ticket    0
            Fare    0
           Cabin    687
        Embarked    2


从初步调查看：
- 缺失值最多的是舱号（Cabin），由于调查问题中不涉及该列, 可不做处理
- 年龄（Age）缺失值较多, 而港口（Embarked）缺失值较少，可以查看缺港口的是否同时缺年龄

In [19]:
# 查看港口和年龄同时缺失的记录数
print sum(original_df['Embarked'].isna() & original_df['Age'].isna())

0


### 3.2 处理缺失数据

调查问题涉及的列中只有年龄和港口有缺失值，且没有两列同时缺失的记录

由于港口缺失值很少，可以考虑将有缺失值的记录删除，用于处理涉及年龄或港口的问题

In [38]:
valid_age_and_harbour_df = original_df.loc[original_df['Age'].notna() & original_df['Embarked'].notna()]
# 验证记录数
print len(valid_age_and_harbour_df)

712


## 4 调查数据
### 4.1 描述数据

In [52]:
if True:
    print original_df['Survived'].describe(),'\n'

if True:
    print original_df['Pclass'].describe(),'\n'

if True:
    print valid_age_and_harbour_df['Age'].describe(),'\n'
    
if True:
    print original_df['Sex'].describe(),'\n'
    
if True:
    print original_df['SibSp'].describe(),'\n'
    
if True:
    print original_df['Parch'].describe(),'\n'
    
if True:
    print original_df['Fare'].describe(),'\n'

count    891.000000
mean       0.383838
std        0.486592
min        0.000000
25%        0.000000
50%        0.000000
75%        1.000000
max        1.000000
Name: Survived, dtype: float64 

count    891.000000
mean       2.308642
std        0.836071
min        1.000000
25%        2.000000
50%        3.000000
75%        3.000000
max        3.000000
Name: Pclass, dtype: float64 

count    712.000000
mean      29.642093
std       14.492933
min        0.420000
25%       20.000000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64 

count      891
unique       2
top       male
freq       577
Name: Sex, dtype: object 

count    891.000000
mean       0.523008
std        1.102743
min        0.000000
25%        0.000000
50%        0.000000
75%        1.000000
max        8.000000
Name: SibSp, dtype: float64 

count    891.000000
mean       0.381594
std        0.806057
min        0.000000
25%        0.000000
50%        0.000000
75%        0.000000
max        

In [None]:
### 4.2 