Dataset link: http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt

The dataset is divided into several variables including Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin and Embarked. By analysing the data and buidling the most suitable model, the analysis should find out what kind of people were most likely to survive in the tragedy

In [47]:
import pandas as pd

In [50]:
df = pd.read_csv('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt')

In [51]:
df.head()

Unnamed: 0,row.names,pclass,survived,name,age,embarked,home.dest,room,ticket,boat,sex
0,1,1st,1,"Allen, Miss Elisabeth Walton",29.0,Southampton,"St Louis, MO",B-5,24160 L221,2,female
1,2,1st,0,"Allison, Miss Helen Loraine",2.0,Southampton,"Montreal, PQ / Chesterville, ON",C26,,,female
2,3,1st,0,"Allison, Mr Hudson Joshua Creighton",30.0,Southampton,"Montreal, PQ / Chesterville, ON",C26,,(135),male
3,4,1st,0,"Allison, Mrs Hudson J.C. (Bessie Waldo Daniels)",25.0,Southampton,"Montreal, PQ / Chesterville, ON",C26,,,female
4,5,1st,1,"Allison, Master Hudson Trevor",0.9167,Southampton,"Montreal, PQ / Chesterville, ON",C22,,11,male


In [52]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1313 entries, 0 to 1312
Data columns (total 11 columns):
row.names    1313 non-null int64
pclass       1313 non-null object
survived     1313 non-null int64
name         1313 non-null object
age          633 non-null float64
embarked     821 non-null object
home.dest    754 non-null object
room         77 non-null object
ticket       69 non-null object
boat         347 non-null object
sex          1313 non-null object
dtypes: float64(1), int64(2), object(8)
memory usage: 71.9+ KB


In [53]:
X = df[['pclass','age','sex']]
y = df['survived']

X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1313 entries, 0 to 1312
Data columns (total 3 columns):
pclass    1313 non-null object
age       633 non-null float64
sex       1313 non-null object
dtypes: float64(1), object(2)
memory usage: 20.6+ KB


In [54]:
X['age'].fillna(X['age'].mean(),inplace=True)
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1313 entries, 0 to 1312
Data columns (total 3 columns):
pclass    1313 non-null object
age       1313 non-null float64
sex       1313 non-null object
dtypes: float64(1), object(2)
memory usage: 20.6+ KB


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._update_inplace(new_data)


In [55]:
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X,y,test_size=0.25,random_state=33)


In [56]:
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False)

In [57]:
X_train = vec.fit_transform(X_train.to_dict(orient='record'))
print(vec.feature_names_)
X_test = vec.fit_transform(X_test.to_dict(orient='record'))


['age', 'pclass=1st', 'pclass=2nd', 'pclass=3rd', 'sex=female', 'sex=male']


In [58]:
from sklearn.tree import DecisionTreeClassifier

dtc = DecisionTreeClassifier()
dtc.fit(X_train,Y_train)
y_predict = dtc.predict(X_test)

In [59]:
from sklearn.metrics import classification_report

print(dtc.score(X_test,Y_test))
print(classification_report(y_predict,Y_test,target_names=['died','survived']))


0.7811550151975684
              precision    recall  f1-score   support

        died       0.91      0.78      0.84       236
    survived       0.58      0.80      0.67        93

    accuracy                           0.78       329
   macro avg       0.74      0.79      0.75       329
weighted avg       0.81      0.78      0.79       329



# Conclusion

1. The accuracy of our prediction model is 0.78. The weighted average of precision is 0.81, the weighted average of recall is 0.78. We think the result is satisfying.

2. In this case, we just take "pclass", "age", "sex" into consideration, ignoring other factors.

# Additional Notes

1. Because of the huge missing data of age, we merely use the average figure to fill in the missing data,which may have a non-negligible impact on the final result.

2. In this case, we only selected three factors based on our subject perspective, and owing to time, we did not analyse individual factors that may impact on survival.