-
Notifications
You must be signed in to change notification settings - Fork 1
/
Titanic.Rmd
351 lines (228 loc) · 10.3 KB
/
Titanic.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
---
title: "Titanic"
author: "Beto"
date: "May 13, 2017"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Titanic: Machine Learning from Disaster
####Competition Description
The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew.
This sensational tragedy shocked the international community and led to better safety regulations for ships.
One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew.
Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.
---
Practice Skills
- Binary classification
- Python and R basics
Overview
The data has been split into two groups:
- training set (train.csv)
- test set (test.csv)
The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.
The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.
---
####Data Dictionary
|Variable | Definition | Key |
|----------|-------------------|-----------------------------|
| survival| Survival |0 = No, 1 = Yes|
| pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd|
| sex | Sex | |
| Age | Age in years| |
|sibsp | nr. of siblings / spouses aboard the Titanic| |
| parch | # of parents / children aboard the Titanic | |
| ticket | Ticket number | |
|fare | Passenger fare | |
|cabin | Cabin number | |
|embarked | Port of Embarkation| C = Cherbourg, Q = Queenstown, S = Southampton |
####Variable Notes
pclass: A proxy for socio-economic status (SES)
- 1st = Upper
- 2nd = Middle
- 3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way...
- Sibling = brother, sister, stepbrother, stepsister
- Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way...
- Parent = mother, father
- Child = daughter, son, stepdaughter, stepson
- Some children travelled only with a nanny, therefore parch=0 for them.
####Importing the data downloaded from kaggle competition
---
```{r message=FALSE, warning=FALSE}
library(tidyverse)
# Import train dataset
train <- read_csv("~/Data Science/kaggle/titanic/train.csv",
col_types =
cols(Embarked = col_factor(levels = c("C","Q","S")),
Pclass = col_factor(levels = c("1","2", "3")),
Sex = col_factor(levels = c("male","female")),
Survived = col_factor(levels = c("0","1"))))
train <- as.data.frame(train)
# Import test dataset
test <- read_csv("~/Data Science/kaggle/titanic/test.csv",
col_types =
cols(Embarked = col_factor(levels = c("C","Q","S")),
Pclass = col_factor(levels = c("1","2", "3")),
Sex = col_factor(levels = c("male","female"))))
test <- as.data.frame(test)
```
---
First we need to know the data in both datasets.
```{r}
head(train) # viewing the first 6 lines
str(train) # structure of data
summary(train) # main statistics
```
---
###Train dataset
It has 891 lines (observations) by 12 coluns (variables)
|Variable | Type | Levels |NAs |
|--------------|-----------|----------|------|
|PassengerdId | Integer | | NOT |
| survived | Factor | 2 | NOT |
|Pclass | Factor | 3 | NOT |
|Name | Character | | NOT |
|Sex | Factor | 2 | NOT |
|Age | Numeric | | 177 |
|SibSp | Integer | | NOT |
|Parch | Integer | | NOT |
|Ticket | Character | | NOT |
|Fare | Numeric | | NOT |
|Cabin | Character | | NOT |
|Embarked | Factor | 3 | 2 |
---
The objective of this challenge is to predict whether or not a passanger of Titanic survived or not, given several variables of each one.
So, how many passangers survived? First construct a table and find the proportions
---
```{r}
# table
table(train$Survived)
# proportions
prop.table(table(train$Survived))
```
---
We can see that in our dataset **549 passangers or 61% didn't survived** and **342 or 38% did survived**.
It would also be helpful to see a two-way table comparing survived with another variable to see if it may have predictive value. For example:
```{r}
# absolute numbers
table(train$Pclass, train$Survived)
# proportions
prop.table(table(train$Pclass, train$Survived), margin = 1)
```
---
It's clear that the number and proportions are bigger on both survived categories from the 1 to the 3rd. class.
---
```{r}
table(train$Sex, train$Survived)
prop.table(table(train$Sex, train$Survived), margin = 1)
```
---
Similarly with gender, 468 or 81% of male died and just 109 or 19% survived.
What about the port of embarkment?
---
```{r}
table(train$Embarked, train$Survived)
round(prop.table(table(train$Embarked, train$Survived), margin = 1),2)
```
---
Note there is decrement in survived (=1) from Cherbourg (France), Queenstown (now Cobh) in Ireland , up to Southampton (England).
It's there a relation with port of embarkment and class?
---
```{r}
table(train$Embarked, train$Pclass)
round(prop.table(table(train$Embarked, train$Pclass), margin = 1),2)
```
---
From table and proportions we can see the following facts:
- Passangers embarked at Cherbourg (France) were 61% on both 1st. and 2nd. class and that could explain that these kind of passangers survived the most
- On the contrary, 94% passangers embarked at Queenstown (now Cobh) in Ireland were at the 3rd. class, meaning that there were poor inmigrants.
- At Southampton (England) embarked most of passangers, being 45% between 1st. and 2nd. class and 55% on the 3rd.
---
####Visualization
```{r}
library(ggplot2)
#
qplot(Pclass, data=train, geom="bar", fill = Survived) +
theme(legend.position = "top")
```
```{r}
qplot(Sex, data=train, geom="bar", fill= Survived) +
theme(legend.position = "top")
```
---
An insteresting investigation could be made on childs. We can expect that if a passenger is a child, there is more chance to survived. Considering a child a person with less than 16 years (?), let's see the numbers...
---
```{r}
# criating a vector with the row where the passenger have less than 16 y.o.
child <- which(train$Age < 16)
# subsetting the train data set
child_status <- subset(train[child,])
# how many survived?
summary(child_status$Survived)
# let's see by Pclass
# table with survived status by Pclass
table(child_status$Pclass, child_status$Survived)
# proportions
prop.table(table(child_status$Pclass, child_status$Survived), margin = 1)
```
---
From numbers, we can see the following interesting facts:
- About 40% of child died and 60% survived, almost the same figures for the total of the train dataset;
- When looking by Pclass, we see than 83% of first class survived, 100% (!!) of second class survived and just 43 from the third class survived.
So be a child from the second or first class was a passport to survived and that characteristic has to be considered when the analysis of the final train data set would be made
---
some manipulations:
- criate a new colum childhood, with the number 1 for being less than 16 y.o.
and 0 for being older.
```{r}
train = mutate(train, chilhood = ifelse (Age < 14,1,0))
```
---
Load the randomForest library, set seed and calculate de random forest classifier
```{r}
library(randomForest)
# Set seed and create a random forest classifier
set.seed(1234)
model_forest <- randomForest(Survived ~ Pclass +
Sex + Fare +
chilhood +
Embarked,
data = train,
importance = TRUE,
ntree = 100,
nodesize = 1,
na.action=na.exclude)
```
```{r}
# predict the values in train dataset
pred_forest_train <- predict(model_forest, train)
# Observe the first few rows of your predictions
head(pred_forest_train)
```
---
####Evaluating the Random Forest
access your first random forest with model_forest and predictions as pred_forest_train.
Use the library caret to view the confusion matrix for the model.
This will tell you the model's accuracy on the training set as well as other performance measures like sensitivity and specificity.
You can see this by running the following code:
```{r}
library(caret)
confusionMatrix(pred_forest_train, train$Survived)
```
---
```{r}
# Variable Importance
# Now it is time to take a look at how important the inputs were
# to your predictive model. Here, you can use:
importance(model_forest)
varImpPlot(model_forest)
```
```{r}
Sys.time()
```
---