In [None]:
library(tidyverse)
library(dplyr)
library(Hmisc)
library(FactoMineR)
library(sqldf)
library(caTools)
library(randomForest)
options(scipen = 999)
library(caret)
library(cowplot)
library(stats)
library(xgboost)
library(pROC)
library(parallel)
library(yardstick)
library(ggplot2)

**Reading in dataset**

In [None]:
test <- read.csv("../input/tabular-playground-series-apr-2021/test.csv")
train <- read.csv("../input/tabular-playground-series-apr-2021/train.csv")

# Subsetting training set into training and validation

**Train EDA**

In [None]:
summary(train)

**Imputing missing values with the mean and mode**

In [None]:
train <- train %>% mutate(Age = impute(Age, mean))
train <- train %>% mutate(Embarked = replace(Embarked, Embarked == "", "S"))
train <- na.omit(train)  

**Distribution of outcome variable**

In [None]:
ggplot(train, aes(x=Survived)) + geom_histogram()


 The survival rate was about 40%

**Survival rate by other variables**

In [None]:
passengerclassandfare <- train %>%
group_by(Pclass) %>%
summarise(Averagefare = mean(Fare),
         Faretandarddeviation = sd(Fare))

passengerclassandfare

On average, the fare was highest in class 1, followed by 2 then 3

In [None]:
surivalbypassengerclass <- train %>%
group_by(Pclass,Survived) %>%
summarise(count = n())

surivalbypassengerclass 

Higher classes 1,2 had less passengers than class 3. More suvived in classes 1&2 than the 3rd class, class 1&2 are premium classes, hence higher fares and fewer passengers

In [None]:
survivalbygender <- train %>%
group_by(Survived,Sex) %>%
summarise(count = n())

survivalbygender 

There were more males than females,overall, but survival of females was higher than males, females were likely given priority in the rescue efforts

In [None]:
surivalbyembarkment <- train %>%
group_by(Embarked,Survived) %>%
summarise(count = n())

surivalbyembarkment 

72% of passengers were going to S,68% of them died,compared to 41% to Q and 25% to  C who died. Over 50% of passengers were in class 3 and 72% of all passengers were going to embarkment point S. Embarkment S also had the highest death rates. This can be atrributed to rescue efforts being prioritized to premium passengers in the higher higher classes who were also heading to embarkments C & Q

In [None]:
surivalbyage <- train %>%
group_by(Survived) %>%
summarise(averageage = mean(Age),
          agestandarddeviation = sd(Age))

surivalbyage

On average, those who survived were about 4 years older than those who died, older people were likely given priority in the rescue efforts

**Train feature engineering**

In [None]:
train <- train %>% mutate(Lonetraveller = SibSp == 0 & Parch == 0,
                          Travelwithparentsandchildren = SibSp == 0 & Parch > 0,
                          Travelwithsiblingsandspouse = SibSp > 0  & Parch == 0,
                          Travelwithsiblingsspouseandchildren = SibSp > 0 & Parch > 0
                         )  


train$Lonetraveller <- ifelse(train$Lonetraveller == "TRUE",1,0)
train$Travelwithparentsandchildren <- ifelse(train$Travelwithparentsandchildren == "TRUE",1,0)
train$Travelwithsiblingsandspouse <- ifelse(train$Travelwithsiblingsandspouse == "TRUE",1,0)
train$Travelwithsiblingsspouseandchildren <- ifelse(train$Travelwithsiblingsspouseandchildren == "TRUE",1,0)

This will establish who travelled alone, who travelled with family/friends. This relationships are likely to have a bearing on the survival rate

**Dropping unneeded fields after feature engineering**

In [None]:
train <-subset(train,select = -c(SibSp,Parch,Ticket,Cabin))
summary(train)

**EDA on feature engineered variables**

In [None]:
lonetravellerssurvival <- train %>%
group_by(Survived) %>%
filter(Lonetraveller == 1) %>%
summarise(count = n())

lonetravellerssurvival

In [None]:
About 60% of the passengers were lone travellers,more than half died

In [None]:
Parentsandchildrensurvival <- train %>%
group_by(Survived) %>%
filter(Travelwithparentsandchildren == 1) %>%
summarise(count = n())

Parentsandchildrensurvival

About 10% of the passengers were parents travelling with children, about 60% survived

In [None]:
siblingsandspousesurvival <- train %>%
group_by(Survived) %>%
filter(Travelwithsiblingsandspouse == 1) %>%
summarise(count = n())

siblingsandspousesurvival

About 10% of passengers were siblings who were travelling with their spouses, about 70% of them died

In [None]:
siblingsspouseandchildrensurvival <- train %>%
group_by(Survived) %>%
filter(Travelwithsiblingsspouseandchildren == 1) %>%
summarise(count = n())
siblingsspouseandchildrensurvival

About 15% of passengers were travelling with siblings, spouses and children, about 70% of them died

**Creating validating set from training set**

In [None]:
train <-subset(train,select = -c(Name,Fare))

In [None]:
smp_size <- floor(0.75 * nrow(train))
## set the seed to make your partition reproducible
set.seed(123)
trainsplit <- sample(seq_len(nrow(train)), size = smp_size)
train1 <- train[trainsplit, ]
validation <- train[-trainsplit, ]
validationy <-subset(validation,select = c(PassengerId,Survived))
validationx <-subset(validation,select = -c(Survived))

**Modelling and predicting train1**

**LM modelling**

In [None]:
model1 <- glm(Survived ~.,data = train1,family = "binomial")
pred1 = predict(model1,validationx,type = "response")
pred1df <- data.frame('PassengerId' = validation$PassengerId, 'Predicted survived' = pred1,'Actual survived'=validationy)
pred1df$Predictedsurvivedbin <- ifelse(pred1df$Predicted.survived>0.5,1,0)


**Displaying confusion matrix from LM**

In [None]:
pred1df$Actual.survived.Survived = as.factor(pred1df$Actual.survived.Survived)
pred1df$Predictedsurvivedbin = as.factor(pred1df$Predictedsurvivedbin)


cm <- conf_mat(pred1df, Actual.survived.Survived, Predictedsurvivedbin)
autoplot(cm, type = "heatmap") +
  scale_fill_gradient(low="#D6EAF8",high = "#2E86C1")


**Confusion matrix summary**


* Model accuracy (all correct / all) is 76.4%
* Misclassification rate (all incorrect / all) is 23.6%
* Precision (true positives / predicted positives) is 81%
* Sensitivity/true positive (true positives / all actual positives) 76.5%
* Specificity (true negatives / all actual negatives) is 76.2%


**ROC LM**

In [None]:

 roc_lm<- roc(pred1df$Actual.survived.Survived, pred1df$Predicted.survived )
 plot(roc_lm, print.auc=TRUE)

**Random forest modelling**

In [None]:
train1$Survived = as.factor(train1$Survived)

model2 <- randomForest(Survived ~., data = train1,ntree = 500, importance = TRUE)
pred2 <- predict(model2, validationx, type = "prob")
pred2df <- data.frame('PassengerId' = validationx$PassengerId, 'Predicted survived' = pred2," Actual survived" =validationy)


**ROC Random Forest**

In [None]:
 roc_randomforest<- roc(pred2df$X.Actual.survived.Survived, pred2df$Predicted.survived.1)
 plot(roc_randomforest, print.auc=TRUE)


**XGBoost modelling**

In [None]:
train1x <- subset(train1, select = -c(Survived))
train1y <-subset(train1,select = Survived)

train1x$Sex <- ifelse(train1x$Sex  == "female",1,0)
train1x$Pclass <-  as.numeric(train1x$Pclass)
train1x$Age <- as.numeric(train1x$Age)
train1x$PassengerId <- as.numeric(train1x$PassengerId)
train1x$Embarked <- as.numeric(train1x$Embarked)

 train1x <- as.matrix(train1x)
 train1y <- as.matrix(train1y)

model3 <- xgboost(data = train1x,  
                 nround = 20, 
                 max.depth = 3,
                 label = train1y,
                 early_stopping_rounds = 7,
                 eval_metric = "auc",
                 objective = "binary:logistic")  

In [None]:
validationxy <- validationx
validationxy$Sex <- ifelse(validationx$Sex  == "female",1,0)
validationxy$Pclass <-  as.numeric(validationx$Pclass)
validationxy$PassengerId <- as.numeric(validationx$PassengerId)
validationxy$Age <- as.numeric(validationx$Age)
validationxy$Embarked <- as.numeric(validationx$Embarked)
validationxy$Pclass <-  as.numeric(validationx$Pclass)

 validationxy <- as.matrix(validationxy)
 pred3 <- predict(model3,validationxy)
 pred3df <- data.frame('PassengerId'= validation$PassengerId, 'Predicted survived'= pred3,' Actual survived'= validationy)

**ROC XG Boost**

In [None]:
 roc_xgboost<- roc(pred3df$X.Actual.survived.Survived, pred3df$Predicted.survived)
 plot(roc_xgboost, print.auc=TRUE)

The linear and random forest models provided the best results from the training and validation 

# Full data sets training and testing

**Test data EDA**

In [None]:
summary(test)

In [None]:
test <- test %>% mutate(Age = impute(Age, mean),
                    Fare = impute(Fare, mean))
test <- test %>% mutate(Embarked = replace(Embarked, Embarked == "", "S"))
test <- na.omit(test)   

test <- test %>% mutate(Lonetraveller = SibSp == 0 & Parch == 0,
                          Travelwithparentsandchildren = SibSp == 0 & Parch > 0,
                          Travelwithsiblingsandspouse = SibSp > 0  & Parch == 0,
                          Travelwithsiblingsspouseandchildren = SibSp > 0 & Parch > 0
                         )  


test$Lonetraveller <- ifelse(test$Lonetraveller == "TRUE",1,0)
test$Travelwithparentsandchildren <- ifelse(test$Travelwithparentsandchildren == "TRUE",1,0)
test$Travelwithsiblingsandspouse <- ifelse(test$Travelwithsiblingsandspouse == "TRUE",1,0)
test$Travelwithsiblingsspouseandchildren <- ifelse(test$Travelwithsiblingsspouseandchildren == "TRUE",1,0)

**Dropping unneeded columns**

In [None]:
test <-subset(test,select = -c(SibSp,Parch,Ticket,Cabin,Name,Fare))

**Feature engineered variables distributions in test set**

In [None]:
Lonetravellerscount <- test %>%
filter(Lonetraveller == 1) %>%
summarise(count = n())
Lonetravellerscount

Lonr travellers made up about 55% of the passengers in the test set

In [None]:
Parentstravellingwithchildren <- test %>%
filter(Travelwithparentsandchildren == 1) %>%
summarise(count = n())

Parentstravellingwithchildren

8% of passengers were parents travelling with children

In [None]:
siblingstravellingwithspouses <- test %>%
filter(Travelwithsiblingsandspouse == 1) %>%
summarise(count = n())

siblingstravellingwithspouses

17% of passengers were siblings travelling with spouses

In [None]:
siblingsspouseandchildrentravel <- test %>%
filter(Travelwithsiblingsspouseandchildren == 1) %>%
summarise(count = n())
siblingsspouseandchildrentravel

21% od passengers were travelling as extended families

**Final modelling**

**Linear modelling**

In [None]:
model4 <- glm(Survived ~.,data = train,family = "binomial")
pred4 = predict(model4,test,type = "response")
pred4df <- data.frame('PassengerId' = test$PassengerId, 'Predicted survived' = pred4)


**Generating submission file from linear modelling**

In [None]:

write.csv(pred4df, file = "submission.csv")

**Random forest modelling**

In [None]:
train$Survived = as.factor(train$Survived)
model5 <- randomForest(Survived ~., data = train1,ntree = 500, importance = TRUE)
pred5 <- predict(model5, test, type = "prob")
pred5df <- data.frame('PassengerId' = test$PassengerId, 'Predicted survived' = pred5)


**Generating submission file from random forest**

In [None]:
write.csv(pred5df, file = "submissionrf.csv")