# Predict survived Titanic dataset by using Logistic Regression
### We use logistic regression to predict binary outcome (0/1).
### In this case we use logistic regression to predict survived of Titanic dataset. Titanic dataset is dataset in library of R. 

In [1]:
install.packages("titanic")

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



## Load Titanic Library

In [2]:
library(titanic)

## Preview dataset
### When we preview dataset, we found that dataset have missing value in some column such as "Age" column.

In [3]:
View(titanic_train)

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
<int>,<int>,<int>,<chr>,<chr>,<dbl>,<int>,<int>,<chr>,<dbl>,<chr>,<chr>
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.2500,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.9250,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1000,C123,S
5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.0500,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,51.8625,E46,S
8,0,3,"Palsson, Master. Gosta Leonard",male,2,3,1,349909,21.0750,,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27,0,2,347742,11.1333,,S
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14,1,0,237736,30.0708,,C


## Number of rows dataset

In [4]:
nrow(titanic_train)

## Drop missing value from dataset
### We should drop missing value before put dataset to model.

In [5]:
titanic_train <- na.omit(titanic_train)
titanic_train

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
Unnamed: 0_level_1,<int>,<int>,<int>,<chr>,<chr>,<dbl>,<int>,<int>,<chr>,<dbl>,<chr>,<chr>
1,1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.2500,,S
2,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C
3,3,1,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.9250,,S
4,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1000,C123,S
5,5,0,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.0500,,S
7,7,0,1,"McCarthy, Mr. Timothy J",male,54,0,0,17463,51.8625,E46,S
8,8,0,3,"Palsson, Master. Gosta Leonard",male,2,3,1,349909,21.0750,,S
9,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27,0,2,347742,11.1333,,S
10,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14,1,0,237736,30.0708,,C
11,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4,1,1,PP 9549,16.7000,G6,S


## Spilt data to 70% for train and 30% for test

In [6]:
set.seed(42)
n <- nrow(titanic_train)
id <- sample(1:n, n*0.7) ## 70% Train 30% Test
train_data <- titanic_train[id, ]
test_data <- titanic_train[-id, ]

## Train model
#### Select "Sex" variable in dataset to train model.
#### When we write equation as Survived = f(Sex).

In [7]:
titanic_model <- glm(Survived ~ Sex, data = train_data, family = "binomial")
prob_train <- predict(titanic_model, type = "response") ## probability of Survived
train_data$Prob_Survived <- prob_train
train_data$Pred_Survived <- ifelse(train_data$Prob_Survived >= 0.5, 1, 0)

In [8]:
confus_matrix_train <- table(train_data$Pred_Survived, train_data$Survived, dnn = c("Predicted", "Actual"))
confus_matrix_train

## Test model

In [9]:
prob_test <- predict(titanic_model, newdata = test_data, type = "response")
test_data$Prob_Survived <- prob_test
test_data$Pred_Survived <- ifelse(test_data$Prob_Survived >= 0.5, 1, 0)

In [11]:
confus_matrix_test <- table(test_data$Pred_Survived, test_data$Survived, dnn = c("Predicted", "Actual"))
confus_matrix_test

## Evaluate model
### We use confusion matrix to evaluate model between train dataset and test dataset to calculate Accuracy, Precision and Recall.

In [15]:
cat("Accuracy of train:", (confus_matrix_train[1, 1]+confus_matrix_train[2, 2])/sum(confus_matrix_train))
cat("\nPrecision of train :", confus_matrix_train[2, 2]/(confus_matrix_train[2, 1]+confus_matrix_train[2, 2]))
cat("\nRecall of train:", confus_matrix_train[2, 2]/(confus_matrix_train[1, 2]+confus_matrix_train[2, 2])) 

Accuracy of train: 0.7835671
Precision of train : 0.7554348
Recall of train: 0.6881188

In [16]:
cat("Accuracy of test:", (confus_matrix_test[1, 1]+confus_matrix_test[2, 2])/sum(confus_matrix_test))
cat("\nPrecision of test :", confus_matrix_test[2, 2]/(confus_matrix_test[2, 1]+confus_matrix_test[2, 2]))
cat("\nRecall of test :", confus_matrix_test[2, 2]/(confus_matrix_test[1, 2]+confus_matrix_test[2, 2]))

Accuracy of test: 0.772093
Precision of test : 0.7532468
Recall of test : 0.6590909

## Conclusion
### Found that Accuracy of test is near by train at around 77%. Moreover, Precision and Recall of test is close to train.