Titanic.Rmd

---
title: "Titanic"
author: "Beto"
date: "May 13, 2017"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Titanic: Machine Learning from Disaster

####Competition Description
 
 The sinking of the RMS Titanic is one of the most infamous shipwrecks in  history. On April 15, 1912, during her maiden voyage, the Titanic sank after  colliding with an iceberg, killing 1502 out of 2224 passengers and crew. 

 This sensational tragedy shocked the international community and led to better safety regulations for ships.
 
 One of the reasons that the shipwreck led to such loss of life was that there  were not enough lifeboats for the passengers and crew. 

 Although there was some element of luck involved in surviving the sinking,  some groups of people were more likely to survive than others, such as women, children, and the upper-class.
 
 In this challenge, we ask you to complete the analysis of what sorts of people  were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

---

 Practice Skills
   
   - Binary classification
   
   - Python and R basics

Overview
The data has been split into two groups:

  - training set (train.csv)

  - test set (test.csv)

 The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

 The test set should be used to see how well your model performs on unseen  data. For the test set, we do not provide the ground truth for each passenger.  It is your job to predict these outcomes. For each passenger in the test set,  use the model you trained to predict whether or not they survived the sinking  of the Titanic.

---

####Data Dictionary
 
|Variable  |     Definition    |          Key                |
|----------|-------------------|-----------------------------|
| survival| Survival    |0 = No, 1 = Yes|
| pclass  | Ticket class     |  1 = 1st, 2 = 2nd, 3 = 3rd|
| sex     | Sex      |                |
| Age     | Age in years| | 
|sibsp    | nr. of siblings / spouses aboard the Titanic| |
| parch   | # of parents / children aboard the Titanic | |
| ticket  | Ticket number | |
|fare     | Passenger fare | |
|cabin    | Cabin number | |
|embarked | Port of Embarkation|  C = Cherbourg, Q = Queenstown, S = Southampton |
 

####Variable Notes

pclass: A proxy for socio-economic status (SES)

  - 1st = Upper
  
  - 2nd = Middle
  
  - 3rd = Lower

 age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

 sibsp: The dataset defines family relations in this way...
 
  - Sibling = brother, sister, stepbrother, stepsister
  
  - Spouse = husband, wife (mistresses and fiancés were ignored)

 parch: The dataset defines family relations in this way...
 
  - Parent = mother, father
  
  - Child = daughter, son, stepdaughter, stepson
  
  -  Some children travelled only with a nanny, therefore parch=0 for them.
 
####Importing the data downloaded from kaggle competition

---

```{r message=FALSE, warning=FALSE}
library(tidyverse)
# Import train dataset
train <- read_csv("~/Data Science/kaggle/titanic/train.csv",
                  col_types = 
                    cols(Embarked = col_factor(levels = c("C","Q","S")), 
                         Pclass = col_factor(levels = c("1","2", "3")), 
                         Sex = col_factor(levels = c("male","female")), 
                         Survived = col_factor(levels = c("0","1"))))
train <- as.data.frame(train)

# Import test dataset
test <- read_csv("~/Data Science/kaggle/titanic/test.csv",
                 col_types = 
                   cols(Embarked = col_factor(levels = c("C","Q","S")), 
                        Pclass = col_factor(levels = c("1","2", "3")), 
                        Sex = col_factor(levels = c("male","female"))))
test <- as.data.frame(test)

```

---

First we need to know the data in both datasets. 

```{r}
head(train)    # viewing the first 6 lines
str(train)     # structure of data
summary(train) # main statistics

```

---

###Train dataset

It has 891 lines (observations) by 12 coluns (variables)

|Variable      |  Type     |  Levels  |NAs   |
|--------------|-----------|----------|------|
|PassengerdId  | Integer   |          | NOT  |
| survived     | Factor    |   2      | NOT  |
|Pclass        | Factor    |   3      | NOT  |    
|Name          | Character |          | NOT  |      
|Sex           | Factor    |  2       | NOT  |      
|Age           | Numeric   |          | 177  |
|SibSp         | Integer   |          | NOT  |      
|Parch         | Integer   |          | NOT  |      
|Ticket        | Character |          | NOT  |      
|Fare          | Numeric   |          | NOT  |      
|Cabin         | Character |          | NOT  |      
|Embarked      | Factor    |   3      |  2   |

---

 The objective of this challenge is to predict whether or not a passanger of Titanic survived or not, given several variables of each one.
  
 So, how many passangers survived? First construct a table and find the proportions
 
---

```{r}
# table
table(train$Survived) 

# proportions
prop.table(table(train$Survived))
```

---

We can see that in our dataset **549 passangers or 61% didn't survived** and **342 or 38% did survived**.

It would also be helpful to see a two-way table comparing survived with another variable to see if it may have predictive value. For example:

```{r}
# absolute numbers
table(train$Pclass, train$Survived)
# proportions
prop.table(table(train$Pclass, train$Survived), margin = 1)

```

---

It's clear that the number and proportions are bigger on both survived categories from the 1 to the 3rd. class.

---

```{r}
table(train$Sex, train$Survived)
prop.table(table(train$Sex, train$Survived), margin = 1)
```

---

Similarly with gender, 468 or 81% of male died and just 109 or 19% survived.

What about the port of embarkment? 

---

```{r}
table(train$Embarked, train$Survived)
round(prop.table(table(train$Embarked, train$Survived), margin = 1),2)
```

---

Note there is decrement in survived (=1) from Cherbourg (France), Queenstown (now Cobh) in Ireland , up to Southampton (England).

It's there a relation with port of embarkment and class?

---

```{r}
table(train$Embarked, train$Pclass)
round(prop.table(table(train$Embarked, train$Pclass), margin = 1),2)
```

---

From table and proportions we can see the following facts:

  - Passangers embarked at Cherbourg (France) were 61% on both 1st. and 2nd. class and that could explain that these kind of passangers survived the most
  
  - On the contrary, 94% passangers embarked at Queenstown (now Cobh) in Ireland were at the 3rd. class, meaning that there were poor inmigrants.
  
  - At Southampton (England) embarked most of passangers, being 45% between 1st. and 2nd. class and 55% on the 3rd.

---

####Visualization

```{r}
library(ggplot2)
#
qplot(Pclass, data=train, geom="bar", fill = Survived) + 
theme(legend.position = "top")

```

```{r}
qplot(Sex, data=train, geom="bar", fill= Survived) + 
theme(legend.position = "top")
```

---

An insteresting investigation could be made on childs. We can expect that if a passenger is a child, there is more chance to survived. Considering a child a person with less than 16 years (?), let's see the numbers...

---

```{r}
# criating a vector with the row where the passenger have less than 16 y.o.
child <- which(train$Age < 16)
# subsetting the train data set
child_status <- subset(train[child,])
# how many survived?
summary(child_status$Survived)

# let's see by Pclass
# table with survived status by Pclass
table(child_status$Pclass, child_status$Survived)
# proportions
prop.table(table(child_status$Pclass, child_status$Survived), margin = 1)


```

---

From numbers, we can see the following interesting facts:

  - About 40% of child died and 60% survived, almost the same figures for the total of the train dataset;
  
  - When looking by Pclass, we see than 83% of first class survived, 100% (!!) of second class survived and just 43 from the third class survived.
  
So be a child from the second or first class was a passport to survived and that characteristic has to be considered when the analysis of the final train data set would be made

---

some manipulations: 
  - criate a new colum childhood, with the number 1 for being less than 16 y.o.
  and 0 for being older.


```{r}
train = mutate(train, chilhood = ifelse (Age < 14,1,0))

```

---

Load the randomForest library, set seed and calculate de random forest classifier
```{r}
library(randomForest)

# Set seed and create a random forest classifier
set.seed(1234)
model_forest <- randomForest(Survived ~ Pclass +
                               Sex + Fare +
                               chilhood +
                               Embarked, 
                             data = train, 
                             importance = TRUE, 
                             ntree = 100, 
                             nodesize = 1,
                             na.action=na.exclude)
```


```{r}
# predict the values in train dataset
pred_forest_train <- predict(model_forest, train)

# Observe the first few rows of your predictions
head(pred_forest_train)
```

---

####Evaluating the Random Forest

access your first random forest with model_forest and predictions as pred_forest_train. 

Use the library caret to view the confusion matrix for the model. 

This will tell you the model's accuracy on the training set as well as other performance measures like sensitivity and specificity. 

You can see this by running the following code:



```{r}
library(caret)
confusionMatrix(pred_forest_train, train$Survived)
```

---

```{r}
# Variable Importance
# Now it is time to take a look at how important the inputs were 
# to your predictive model. Here, you can use:
importance(model_forest)

varImpPlot(model_forest)
```


```{r}
Sys.time()
```


---