# 作业1：Titanic生存数据分析


## 数据集介绍
Titanic生存数据是Kaggle上的经典入门比赛的数据集，主要反映不同属性的乘客的生存情况，下面对各个属性进行简单的描述：
* PassengerId: 乘客唯一的ID
* Survived: 是否存活，1表示存活，0表示死亡
* Pclass：乘客所属的船舱等级
* Name: 乘客姓名
* Sex：乘客性别
* Age：乘客年龄
* SibSp：乘客的兄弟姊妹和配偶总数
* Parch：乘客的父母子女总数
* Ticket：乘客的票号
* Fare：乘客的票价
* Cabin：乘客的座位号
* Embarked：乘客的出发地点

In [15]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

## 加载数据集
加载训练数据集，并且通过descibe()方法和isnull()方法对数据集进行简单的初步分析

In [16]:
train_data = pd.read_csv("kaggle/input/titanic/train.csv")

In [17]:
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [18]:
train_data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

## 数据预处理


首先由于年龄比较重要且会影响到分析结果，我打算使用随机森林对Age进行填补。

In [19]:
# Drop the Ticket column
train_data = train_data.drop(['Ticket'], axis=1)
train_data = train_data.drop(['Name'], axis=1)
train_data = train_data.drop(['PassengerId'], axis=1)

大部分乘客的Embarked都是S，所以我们合理猜测缺失数据为'S'的可能性最大

In [20]:
train_data['Embarked'].fillna('S', inplace = True)

Cabin在数据中是缺失的，我们可以将其设置为缺失值，并且将其设置为字符串类型

In [21]:
train_data['Cabin'].fillna('U', inplace = True)

In [22]:
train_data.isnull().sum()

Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Cabin         0
Embarked      0
dtype: int64

In [23]:
train_data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,0,3,male,22.0,1,0,7.25,U,S
1,1,1,female,38.0,1,0,71.2833,C85,C
2,1,3,female,26.0,0,0,7.925,U,S
3,1,1,female,35.0,1,0,53.1,C123,S
4,0,3,male,35.0,0,0,8.05,U,S


解决完其他所有属性的问题之后，需要先将数据集中的字符串属性转换为数值属性，然后再进行数据的填补

In [24]:
# Convert the Embarked column to numeric
train_data['Embarked'] = train_data['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})
# Convert the Sex column to numeric
train_data['Sex'] = train_data['Sex'].map({'female' : 0, 'male' : 1})
# the first alphabet of each row in Cabin
train_data['Deck'] = train_data['Cabin'].map(lambda x : x[0])
train_data['Deck'] = train_data['Deck'].map({'U' : 0, 'A' : 1, 'B' : 2, 'C' : 3, 'D' : 4, 'E' : 5, 'F' : 6, 'G' : 7, 'T' : 8})
train_data.drop(['Cabin'], axis=1, inplace=True) 
# Fill the missing fare with the median age of the class
# test_data['Fare'].fillna(test_data.groupby('Pclass')['Fare'].transform('median'), inplace=True)

In [25]:
train_data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Deck
0,0,3,1,22.0,1,0,7.25,0,0
1,1,1,0,38.0,1,0,71.2833,1,3
2,1,3,0,26.0,0,0,7.925,0,0
3,1,1,0,35.0,1,0,53.1,0,3
4,0,3,1,35.0,0,0,8.05,0,0


In [26]:
train_data.describe()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Deck
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0,891.0,891.0
mean,0.383838,2.308642,0.647587,29.699118,0.523008,0.381594,32.204208,0.361392,0.776655
std,0.486592,0.836071,0.47799,14.526497,1.102743,0.806057,49.693429,0.635673,1.590899
min,0.0,1.0,0.0,0.42,0.0,0.0,0.0,0.0,0.0
25%,0.0,2.0,0.0,20.125,0.0,0.0,7.9104,0.0,0.0
50%,0.0,3.0,1.0,28.0,0.0,0.0,14.4542,0.0,0.0
75%,1.0,3.0,1.0,38.0,1.0,0.0,31.0,1.0,0.0
max,1.0,3.0,1.0,80.0,8.0,6.0,512.3292,2.0,8.0


In [27]:
from sklearn.ensemble import RandomForestRegressor

titanicWithAge = train_data[pd.isnull(train_data['Age']) == False]
titanicWithoutAge = train_data[pd.isnull(train_data['Age'])]

variables = ['Pclass', 'Sex', 'SibSp', 'Parch', 'Fare', 'Embarked', 'Deck']

rfModel_age = RandomForestRegressor()
rfModel_age.fit(titanicWithAge[variables], titanicWithAge['Age'])

generatedAgeValues = rfModel_age.predict(X = titanicWithoutAge[variables])

titanicWithoutAge['Age'] = generatedAgeValues.astype(int)
train_data = titanicWithAge.append(titanicWithoutAge)

train_data.reset_index(inplace=True)
train_data.drop('index', inplace=True, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  titanicWithoutAge['Age'] = generatedAgeValues.astype(int)
  train_data = titanicWithAge.append(titanicWithoutAge)


In [28]:
train_data.isnull().sum()

Survived    0
Pclass      0
Sex         0
Age         0
SibSp       0
Parch       0
Fare        0
Embarked    0
Deck        0
dtype: int64

In [None]:
from sklearn.ensemble import RandomForestClassifier

y = train_data["Survived"]

features = ['Pclass', 'SibSp', 'Parch', 'Fare', 'female', 'male', 'C', 'Q', 'S']
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies([features])

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")