# Análisis exploratorio de datos (EDA)
### Autores: Ricardo, Eusebio, Marcos, Adrián José Martínez Navarro, Javier Gamero Muñoz 

# Índice 
* [Introducción al problema](#introducción-al-problema)
* [Recodificación de variables](#recodificación-de-variables)
  * [Variable ```Cabin```](#recodificación-de-la-variable-cabin)
  * [Creación de la variable ```Alone```](#creación-de-la-variable-alone)
* [Valores perdidos / *Missing values*](#valores-perdidos-o-missing-values)
  * [Missing values categóricos](#missing-values-categóricos)
  * [Missing values numéricos](#missing-values-numéricos)
* [Análisis de las variables con gráficas](#análisis-de-las-variables-con-gráficas)
  * [Análisis univariable](#análisis-univariable)
  * [Análisis bivariable](#análisis-bivariable)

# Introducción al problema
El problema que se presenta tiene lugar en el año 2912, donde la nave espacial 
*Titanic* colisionó con una anomalía espacial. Como resultado de este accidente, 
algunos de los pasajeros fueron transportados a otra dimensión. De esta manera, el objetivo de este trabajo es clasificar (a partir de aquellos que sabemos que han sido transportados o no transportados, **train**) si un menor grupo de personas (**test**) han sido transportadas. Para ello, se hará uso de cinco algoritmos de clasificación: 
* *k nearest neighbors (kNN)* 
* Árboles de clasificación
* *Supported vector machine (SVM)*
* Regresión logística
* Naïve Bayes

De la página de 
[kaggle](https://www.kaggle.com/competitions/spaceship-titanic/data)
de donde se sacan los datos, se obtiene que cada instancia corresponde a la información descriptora de cada pasajero definida por las siguientes variables:

**Variables de entrada:**
* ```PassengerId```: Identifiación para cada pasajero en el formato *gggg_pp*, 
  donde *gggg* indica su grupo de viaje y *pp* su número identificador dentro del grupo. Personas en un 
  mismo grupo suelen ser familias, aunque no siempre.
* ```Homeplanet```: Planeta donde embarca el pasajero.
* ```CryoSleep```: Indica si el pasajero fue puesto en animación suspendida 
  durante el viaje.
* ```Cabin```: número de cabina donde el pasajero viajaba en el formato 
  *deck/num/side* donde *side* puede ser *P* para *babor* o *S* para *estribor*.
* ```Destination```: Planeta donde el pasajero desembarca.
* ```Age```: Edad del pasajero.
* ```VIP```: Indica si el pasajero tenía un servicio VIP durante el viaje.
* ```RoomService```, ```FoodCourt```, ```ShoppingMall```, ```Spa```, 
  ```VRDeck```: Cantidad de dinero empleado en los distintos servicios de la nave.
* ```Name```: Nombre y primer apellido del pasajero.

**Variable de salida:**
* ```Transported```:Indica si el pasajero fue 
  transportado a otra dimensión o no.

De esta manera, los datos recogidos están separados en dos datasets, ```train.csv``` y 
```test.csv```, teniendo el primero de ellos los casos donde sí se conoce el 
destino de las personas (si han sido transportadas o no), mientras que en el 
segundo de ellos no. 

Comenzamos estudiando las variables del dataset ```train.csv```:

In [None]:
library(tidyverse)
library(Amelia)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.4.0      [32m✔[39m [34mpurrr  [39m 0.3.4 
[32m✔[39m [34mtibble [39m 3.1.8      [32m✔[39m [34mdplyr  [39m 1.0.10
[32m✔[39m [34mtidyr  [39m 1.2.1      [32m✔[39m [34mstringr[39m 1.4.1 
[32m✔[39m [34mreadr  [39m 2.1.3      [32m✔[39m [34mforcats[39m 0.5.2 
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
Loading required package: Rcpp

## 
## Amelia II: Multiple Imputation
## (Version 1.8.0, built: 2021-05-26)
## Copyright (C) 2005-2023 James Honaker, Gary King and Matthew Blackwell
## Refer to http://gking.harvard.edu/amelia/ for more information
## 



In [None]:
df = read.csv("../data/train.csv", header = TRUE)
head(df)
str(df)

En el dataset del train se tienen 8693 observaciones y 14 variables, de las 
cuales 8 están formadas por strings y el resto son numéricas. Algunas de las variables formadas por caracteres (como las buleanas de 'True' o 'False') corresponderán a factores como se observará durante el desarrollo de este documento. En principio se dejarán codificadas así teniendo en cuenta esto.   

In [None]:
sum(duplicated(df))

Además, no se encuentran observaciones duplicadas.

# Recodificación de variables
Para comenzar, se recodifican algunas de las variables expuestas y se crean otras
que pueden resultar de importancia.
## Recodificación de la variable ```cabin``` 
La variable ```Cabin```, con el formato *deck/num/side* se dividide en 
tres variables diferentes: ```Cabin_deck```, ```Cabin_num``` y ```Cabin_side```, ya que así puede aportar mayor información.

In [None]:
cabin_splitted = str_split(df$Cabin, '/', simplify = TRUE)
#id_splitted = str_split(df$PassengerId, '_', simplify = TRUE)

In [None]:
#colnames(id_splitted) = c('group_id','personal_id')
#head(id_splitted)

Se comprueban los distintos valores que toman estas nuevas variables:

In [None]:
table(cabin_splitted[, 1])#Cabin deck

In [None]:
length(unique(cabin_splitted[,2]))#Cabin_num

In [None]:
table(cabin_splitted[, 3])#Cabin_side

In [None]:
df = df %>% mutate(Cabin_deck = cabin_splitted[, 1], 
                   Cabin_num = as.integer(cabin_splitted[, 2]), 
                   Cabin_side = cabin_splitted[, 3]) %>% select(-Cabin)
head(df)

Se puede apreciar que Cabin_deck y Cabin_side contienen algunos valores iguales a caracteres vacíos, estos son valores 
perdidos que serán tratados más adelante. 

## Creación de la variable ```Alone```
También, la variable ```PassengerId``` en el formato *gggg_pp* puede indicar si los 
pasajeros viajaban solos o en grupo, lo cual puede ser de interés saberlo. Se crea una variable que indique esto:

In [None]:
df = df %>% mutate(Group = (str_split(PassengerId, '_', simplify = TRUE))[,1])
vector_group = df %>% count(Group) %>% filter(n > 1)
vector_group = vector_group$Group
df = df %>% mutate(Alone = ifelse(Group %in% vector_group, "False", "True")) %>% select(-Group)

In [None]:
head(df)

In [None]:
df %>% count(Alone)

# Valores perdidos o *Missing values*
Una vez recodificadas las variables para entender mejor los datos, se procede a 
estudiar los valores perdidos del dataset ```train.csv```.

In [None]:
missmap(df)

En las variables numéricas se pueden observar missing values. Por otro lado, en las variables formadas por strings puede que los missing values estén indicadas con un valor distinto a NA.

In [None]:
print(table(df$HomePlanet))
print(table(df$Destination))
print(table(df$CryoSleep))
print(table(df$VIP))

Como se pudo observar con la variable ```Cabin```, los valores perdidos parecen 
estar registrados como caracteres vacíos, los convertimos en ```NA```:

In [None]:
df$Cabin_deck[df$Cabin_deck == ""] = NA
df$Cabin_num[df$Cabin_num == ""] = NA
df$Cabin_side[df$Cabin_side == ""] = NA
df$HomePlanet[df$HomePlanet == ""] = NA
df$Destination[df$Destination == ""] = NA
df$CryoSleep[df$CryoSleep == ""] = NA
df$VIP[df$VIP == ""] = NA

In [None]:
missmap(df)

En total, los valores perdidos no suponen un gran porcentaje del dataset 
completo, por lo que se evalúa cuantas instancias eliminaríamos de borrar 
aquellas con algún valor perdido:

In [None]:
# Porcentaje de filas incompletas
sum(!complete.cases(df))/nrow(df)

De eliminar estas filas, eliminaríamos el $22\%$ de los datos, por lo que 
se descarta esta opción. Una imputación es necesaria.

Ahora se pasa al tratamiento de los missing values para los atributos categóricos primero, y luego para los atributos numéricos. 

## Missing values categóricos
Como método de imputación general, se le asignará a cada ```NA``` la **moda** de su atributo:

In [None]:
mode_hp = (df %>% group_by(HomePlanet) %>% summarize(n = n()) %>% na.omit %>% top_n(1))$HomePlanet
mode_cabin_deck = (df %>% group_by(Cabin_deck) %>% summarize(n = n()) %>% na.omit %>% top_n(1))$Cabin_deck
mode_cabin_side = (df %>% group_by(Cabin_side) %>% summarize(n = n()) %>% na.omit %>% top_n(1))$Cabin_side
mode_dest = (df %>% group_by(Destination) %>% summarize(n = n()) %>% na.omit %>% top_n(1))$Destination
mode_cs = (df %>% group_by(CryoSleep) %>% summarize(n = n()) %>% na.omit %>% top_n(1))$CryoSleep
mode_vip = (df %>% group_by(VIP) %>% summarize(n = n()) %>% na.omit %>% top_n(1))$VIP


In [None]:
df_mode = data.frame(mode_hp = mode_hp, mode_cabin_deck = mode_cabin_deck, mode_cabin_side = mode_cabin_side, mode_cs = mode_cs, mode_vip = mode_vip, mode_dest = mode_dest)
df_mode
write.csv(df_mode,'../data/modes.csv', row.names = FALSE)

In [None]:
df_imputed = data.frame(df)

In [None]:
df_imputed$Cabin_deck[is.na(df_imputed$Cabin_deck)] = mode_cabin_deck
df_imputed$Cabin_side[is.na(df_imputed$Cabin_side)] = mode_cabin_side
df_imputed$HomePlanet[is.na(df_imputed$HomePlanet)] = mode_hp
df_imputed$Destination[is.na(df_imputed$Destination)] = mode_dest
df_imputed$VIP[is.na(df_imputed$VIP)] = mode_vip

Sin embargo, llama la atención el atributo ```CryoSleep```. Se observa que, en 
los casos donde sus variables numéricas (correspondientes a los gastos) son 
nulas, hay valores perdidos también:

In [None]:
df_imputed %>% filter(RoomService==0,FoodCourt == 0, ShoppingMall == 0, Spa == 0,
                     VRDeck == 0) %>% count(CryoSleep)

En concreto hay 87 valores perdidos, 2690 casos donde esto se cumple y 470 donde 
no. Dado que la mayoría de pasajeros que no han gastado dinero en el viaje son 
personas en animación suspendida, se imputan estos 87 valores perdidos como 
```CryoSleep = 'True'```.

In [None]:
cs_idTrue = df_imputed %>% filter(is.na(CryoSleep),RoomService==0,
                                  FoodCourt == 0, ShoppingMall == 0, Spa == 0,
                                  VRDeck == 0) %>% select(PassengerId)

df_imputed[df_imputed$PassengerId %in% cs_idTrue[,1],]$CryoSleep = 'True'

Por el contrario, si alguno de los gastos es distinto de cero, se podría 
considerar que el pasajero no está en animación suspendida:

In [None]:
df_imputed %>% filter(RoomService!=0 | FoodCourt != 0 | ShoppingMall != 0 | Spa != 0 |
                     VRDeck != 0) %>% count(CryoSleep)

En efecto, todas las personas que han tenido algún gasto en la nave no están en 
animación suspendida a excepción de los valores perdidos. Los imputamos como 
```CryoSleep = 'False'```:

In [None]:
cs_idFalse = df_imputed %>% filter(is.na(CryoSleep), (RoomService!=0 | FoodCourt != 0 | ShoppingMall != 0 | Spa != 0 |
                     VRDeck != 0)) %>% select(PassengerId) 

In [None]:
df_imputed[df_imputed$PassengerId %in% cs_idFalse[,1],]$CryoSleep = 'False'

In [None]:
df_imputed %>% filter(is.na(CryoSleep))

Pero sigue habiendo 11 valores perdidos en CryoSleep en los casos donde 
alguno de los gastos es ```NA``` y el resto cero. Se comprueba que valores de CryoSleep suelen tener estos casos:

In [None]:
df_imputed %>% filter((is.na(RoomService) | RoomService == 0),
                      (is.na(FoodCourt) | FoodCourt == 0), 
                      (is.na(Spa) | Spa == 0), 
                      (is.na(ShoppingMall) | ShoppingMall == 0), 
                      (is.na(VRDeck) | VRDeck == 0)) %>%
               mutate(total_expenses = RoomService + FoodCourt + Spa + ShoppingMall + VRDeck) %>%
               filter(is.na(total_expenses)) %>% count(CryoSleep)

Imputamos estos casos como ```CryoSleep = 'True'```, dado que es extraño que 
los pasajeros solo gasten en uno de los servicios y que justo ese sea un valor 
perdido también:

In [None]:
cs_idTrue_NAnum = df_imputed %>% filter(is.na(CryoSleep)) %>% select(PassengerId)

In [None]:
df_imputed[df_imputed$PassengerId %in% cs_idTrue_NAnum[,1],]$CryoSleep = 'True'

In [None]:
df_imputed %>% count(CryoSleep)

## Missing values numéricos
Procedemos ahora con los valores perdidos numéricos.

In [None]:
missmap(df_imputed)

Comenzamos con los casos donde alguno de los gastos es un valor perdido, el 
resto son cero y están en animación suspendida:

In [None]:
df_imputed %>% filter(CryoSleep == "True", (is.na(RoomService) | is.na(VRDeck) | is.na(Spa) | is.na(ShoppingMall) | is.na(FoodCourt))) %>%
                  mutate(total_expenses = RoomService + FoodCourt + Spa + ShoppingMall + VRDeck) %>%
                  filter(is.na(total_expenses)) %>% head()

Como mencionamos anteriormente, las personas en animación suspendida no han 
podido tener gastos, por lo que estos valores perdidos deben ser cero:

In [None]:
expenses_IdZero_RoomService = df_imputed %>% filter(CryoSleep == "True", is.na(RoomService)) %>% select(PassengerId)
expenses_IdZero_VRDeck = df_imputed %>% filter(CryoSleep == "True", is.na(VRDeck)) %>% select(PassengerId)
expenses_IdZero_Spa = df_imputed %>% filter(CryoSleep == "True", is.na(Spa)) %>% select(PassengerId)
expenses_IdZero_ShoppingMall = df_imputed %>% filter(CryoSleep == "True", is.na(ShoppingMall)) %>% select(PassengerId)
expenses_IdZero_FoodCourt = df_imputed %>% filter(CryoSleep == "True", is.na(FoodCourt)) %>% select(PassengerId)

In [None]:
df_imputed[df_imputed$PassengerId %in% expenses_IdZero_RoomService[,1],]$RoomService = 0
df_imputed[df_imputed$PassengerId %in% expenses_IdZero_VRDeck[,1],]$VRDeck = 0
df_imputed[df_imputed$PassengerId %in% expenses_IdZero_Spa[,1],]$Spa = 0
df_imputed[df_imputed$PassengerId %in% expenses_IdZero_ShoppingMall[,1],]$ShoppingMall = 0
df_imputed[df_imputed$PassengerId %in% expenses_IdZero_FoodCourt[,1],]$FoodCourt = 0

In [None]:
missmap(df_imputed)

Comprobamos los casos donde ```CryoSleep = 'False'``` y hay algún valor 
perdido en los gastos:

In [None]:
df_imputed %>% filter(CryoSleep == "False", (is.na(RoomService) | is.na(VRDeck) |
                      is.na(Spa) | is.na(ShoppingMall) | is.na(FoodCourt))) %>% 
                      head()

En estas observaciones, dada las distribuciones de las variables que se mostrarán en el siguiente apartado, lo más adecuado es imputar
los ```NA``` por las medianas de sus atributos:

In [None]:
expenses_IdMedian_RoomService = df_imputed %>% filter(CryoSleep == "False", is.na(RoomService)) %>% select(PassengerId)
expenses_IdMedian_VRDeck = df_imputed %>% filter(CryoSleep == "False", is.na(VRDeck)) %>% select(PassengerId)
expenses_IdMedian_Spa = df_imputed %>% filter(CryoSleep == "False", is.na(Spa)) %>% select(PassengerId)
expenses_IdMedian_ShoppingMall = df_imputed %>% filter(CryoSleep == "False", is.na(ShoppingMall)) %>% select(PassengerId)
expenses_IdMedian_FoodCourt = df_imputed %>% filter(CryoSleep == "False", is.na(FoodCourt)) %>% select(PassengerId)

In [None]:
vd_median = median(df_imputed$VRDeck, na.rm = TRUE)
spa_median = median(df_imputed$Spa, na.rm = TRUE)
sm_median = median(df_imputed$ShoppingMall, na.rm = TRUE)
fc_median = median(df_imputed$FoodCourt, na.rm = TRUE)
rs_median = median(df_imputed$RoomService, na.rm = TRUE)
age_median = median(df_imputed$Age, na.rm = TRUE)

df_imputed[df_imputed$PassengerId %in% expenses_IdMedian_RoomService[,1],]$RoomService = rs_median
df_imputed[df_imputed$PassengerId %in% expenses_IdMedian_VRDeck[,1],]$VRDeck = vd_median
df_imputed[df_imputed$PassengerId %in% expenses_IdMedian_Spa[,1],]$Spa = spa_median
df_imputed[df_imputed$PassengerId %in% expenses_IdMedian_ShoppingMall[,1],]$ShoppingMall = sm_median
df_imputed[df_imputed$PassengerId %in% expenses_IdMedian_FoodCourt[,1],]$FoodCourt = fc_median

In [None]:
df_median = data.frame(rs_median = rs_median, vd_median = vd_median, spa_median = spa_median, sm_median = sm_median, fc_median = fc_median, age_median = age_median)
df_median
write.csv(df_median,'../data/median.csv', row.names = FALSE)

In [None]:
missmap(df_imputed)

La edad, dado también su distribución de valores, se imputa por la mediana.

In [None]:
df_imputed[is.na(df_imputed$Age), ]$Age = age_median

In [None]:
missmap(df_imputed)

Queda la variable Cabin_num. Sin embargo esta variable parece ser demasiado específica para cada instancia y que no tenga correlación con la variable de salida transported. 

## Outliers univariables

In [None]:
df_imputed.IQR <- df_imputed %>% select(-Cabin_num) %>% select_if(is.numeric) %>% apply(2, IQR)
df_imputed.Quartiles <- df_imputed %>% select(-Cabin_num) %>% select_if(is.numeric) %>% apply(2, quantile,c(0.25,0.75))
Upper.limit <- df_imputed.Quartiles[2,]+1.5*df_imputed.IQR
Upper.limit

In [None]:
df_imputed %>% filter(RoomService<Upper.limit['RoomService'], FoodCourt<Upper.limit['FoodCourt'],
                      ShoppingMall<Upper.limit['ShoppingMall'], Spa<Upper.limit['Spa'], 
                      VRDeck<Upper.limit['VRDeck']) %>% nrow()

## Análisis univariable

In [None]:
head(df_imputed)

In [None]:
ggplot(df_imputed, aes(x = Transported)) +
    geom_bar()

In [None]:
ggplot(df_imputed, aes(x = HomePlanet)) +
    geom_bar()

In [None]:
ggplot(df_imputed, aes(x = CryoSleep)) +
    geom_bar()

In [None]:
ggplot(df_imputed, aes(x = VIP)) +
    geom_bar()

In [None]:
ggplot(df_imputed, aes(x = Destination)) +
    geom_bar()

In [None]:
ggplot(df_imputed, aes(x = Cabin_deck)) +
    geom_bar()

In [None]:
ggplot(df_imputed, aes(x = Cabin_side)) +
    geom_bar()

In [None]:
df_imputed %>% pivot_longer(cols = c("RoomService", "Spa", "FoodCourt", "VRDeck", "ShoppingMall")) %>%
               ggplot(., aes(x = value)) +
                geom_density() +
                facet_wrap(~name, scale = "free")

In [None]:
ggplot(df_imputed, aes(x = Age)) +
    geom_density()

## Análisis multivariable

In [None]:
df_imputed %>% filter(RoomService > 10000 | Spa > 10000 | 
                      FoodCourt > 10000 | VRDeck > 10000 |
                      ShoppingMall > 10000 ) %>% count(Cabin_deck)
                    

In [None]:
ggplot(df_imputed, aes(x = CryoSleep, fill = CryoSleep)) +
    geom_bar() +
    facet_wrap(~Transported)

Se puede ver que si no fuiste transportado seguramente no estabas en CryoSleep. Sin embargo, esta regla de asociación en el sentido contrario no se cumple. la cual hubiera sido interesante para clasificación.

In [None]:
ggplot(df_imputed, aes(x = HomePlanet, fill=Transported)) +
    geom_bar(color='black', alpha=0.5, position='dodge')

library(vcd)
mosaic(~ HomePlanet + Transported, data = df_imputed,shade=T)

No hay dependencia clara con el país de origen para Marte. En el caso de Europa y La Tierra si hay un desvalance entre transportados y no transportados.

In [None]:
ggplot(df_imputed, aes(x = Transported, fill = Transported)) +
    geom_bar() +
    facet_wrap(~VIP)

No parece haber una discriminación en los transportados si pertenece el pasajero a VIP.

In [None]:
ggplot(df_imputed, aes(x = Destination, fill=Transported)) +
    geom_bar(color='black', alpha=0.5, position='dodge')

In [None]:
ggplot(df_imputed, aes(x = Cabin_deck, fill=Transported)) +
    geom_bar(color='black', alpha=0.5, position='dodge')

In [None]:
ggplot(df_imputed, aes(x = Alone, fill=Transported)) +
    geom_bar(color='black', alpha=0.5, position='dodge')

mosaic(~ Transported + Alone, data = df_imputed,shade=T)

In [None]:
ggplot(df_imputed, aes(x = Age, fill=Transported)) +
    geom_density(alpha=0.4)# +
    #facet_wrap(~Transported)

Parece observarse que los niños menores de 10 años son más transportados que no transportados.

In [None]:
ggplot(df_imputed %>% filter(RoomService<3000), aes(x = RoomService, fill=Transported)) +
    geom_density(alpha=0.4)

In [None]:
ggplot(df_imputed, aes(y = RoomService, fill = Transported)) +
    geom_boxplot() +
    facet_wrap(~Transported)

In [None]:
df_imputed %>% filter(RoomService > 5e3)

Los transported igual a True tienen un RoomService más concentrados en 0.

In [None]:
ggplot(df_imputed, aes(x = FoodCourt, fill = Transported)) +
    geom_boxplot() +
    coord_flip() +
    facet_wrap(~Transported)

In [None]:
df_imputed %>% filter(FoodCourt > 20e3)
df_imputed %>% filter(FoodCourt > 5e3) %>% group_by(HomePlanet) %>% 
    summarise(n = n())

In [None]:
ggplot(df_imputed, aes(x = Spa, fill = Transported)) +
    geom_boxplot() +
    coord_flip() +
    facet_wrap(~Transported)

In [None]:
df_imputed %>% filter(Spa > 15e3)
df_imputed %>% filter(Spa > 10e3) %>% group_by(HomePlanet) %>% 
    summarise(n = n())

In [None]:
ggplot(df_imputed, aes(x = VRDeck, fill = Transported)) +
    geom_boxplot() +
    coord_flip() +
    facet_wrap(~Transported)

In [None]:
df_imputed %>% filter(VRDeck > 15e3)
df_imputed %>% filter(VRDeck > 10e3) %>% group_by(HomePlanet) %>% 
    summarise(n = n())

In [None]:
ggplot(df_imputed, aes(x = ShoppingMall, fill = Transported)) +
    geom_boxplot() +
    coord_flip() +
    facet_wrap(~Transported)

In [None]:
df_imputed %>% filter(ShoppingMall > 10e3)
df_imputed %>% filter(ShoppingMall > 10e3) %>% group_by(HomePlanet) %>% 
    summarise(n = n())

In [None]:
ggplot(df_imputed %>% mutate(total_expenses = RoomService + FoodCourt + Spa + ShoppingMall + VRDeck), aes(x = total_expenses, fill = VIP)) +
    geom_boxplot() +
    coord_flip() +
    facet_wrap(~VIP)

In [None]:
df_imputed %>% mutate(total_expenses = RoomService + FoodCourt + Spa + ShoppingMall + VRDeck) %>% 
    filter(total_expenses > 20e3)
df_imputed %>% mutate(total_expenses = RoomService + FoodCourt + Spa + ShoppingMall + VRDeck) %>% 
    filter(VRDeck > 10e3) %>% group_by(HomePlanet) %>% 
    summarise(n = n())

In [None]:
ggplot(df_imputed, aes(x = Destination, fill = Transported)) +
    geom_bar(position = 'dodge') +
    facet_wrap(~CryoSleep)


In [None]:
ggplot(df_imputed, aes(x = Cabin_deck, fill = CryoSleep)) +
    geom_bar(position = 'dodge')

Observar la ```F```

In [None]:
ggplot(df_imputed, aes(x = CryoSleep, fill = Transported)) +
    geom_bar(position = 'dodge')+
    facet_wrap(~HomePlanet)

Los que están en CryoSleep, si vienen de Europa o Marte serán transportados. En el caso de que proceda de la Tierra no es tan claro.

In [None]:
ggplot(df_imputed , aes(x = CryoSleep, fill = Transported)) +
    geom_bar(position = 'dodge')+
    facet_wrap(~Destination)

Los que van a Cancri y cryosleep igual a True son transportados en su mayoría.

In [None]:
ggplot(df_imputed, aes(x = Destination, fill = Transported)) +
    geom_bar(position = 'dodge')+
    facet_wrap(~HomePlanet)

Los false de transported para earth son sobretodo para destination igual a Trappist-1e.

In [None]:
ggplot(df_imputed, aes(x = CryoSleep, fill = Transported)) +
    geom_bar(position = 'dodge')+
    facet_wrap(~Cabin_deck)

In [None]:
ggplot(df_imputed, aes(x = Cabin_deck, fill = CryoSleep)) +
    geom_bar(position = 'dodge') +
    facet_wrap(~Transported)

In [None]:
ggplot(df_imputed, aes(x = Cabin_deck, fill = Transported)) +
    geom_bar(position = 'dodge') +
    facet_wrap(~CryoSleep)

In [None]:
ggplot(df_imputed, aes(x = HomePlanet, fill = Cabin_deck)) +
    geom_bar(position = 'dodge') 

CryoSleep = True -->personas en sus cabinas seguro --> A,B,C,D,F mas transportadas  
Distinta distribución en cabinas por planeta de origen --> G terrícolas no tantos 
transportados --> graficas celda inferior se ve esa diferencia.

In [None]:
ggplot(df_imputed, aes(x = HomePlanet, fill=Transported)) +
    geom_bar(color='black', alpha=0.5, position='dodge')

Observar los CryoSleep = True --> en la habitación seguro --> mas transportados ??

In [None]:
df_imputed %>% filter(VIP == 'True') %>% 
    ggplot( aes(x = VIP, fill=Transported)) +
    geom_bar(color='black', alpha=0.5, position='dodge') +
    scale_y_continuous(breaks = seq(0,5000,200))

In [None]:
df_imputed %>% 
    ggplot(aes(x = VIP, fill=HomePlanet)) +
    geom_bar(color='black', alpha=0.5, position='dodge') +
    scale_y_continuous(breaks = seq(0,5000,200)) 

In [None]:
df_imputed %>% count(VIP, Transported)

In [None]:
df_imputed %>% ggplot(aes(x = Alone, fill = Cabin_deck)) +
    geom_bar(position = 'dodge') 

In [None]:
df_imputed %>% 
    ggplot(aes(x = Age, fill = CryoSleep)) +
    geom_density(alpha = 0.4, position = 'dodge') +
    facet_wrap(~Cabin_deck)

In [None]:
ggplot(df_imputed,aes(x = RoomService, fill=Transported))+
        geom_density()

In [None]:
ggplot(df_imputed, aes(x = VRDeck, y = FoodCourt, color = Transported)) +
    geom_point(alpha = 0.4)

In [None]:
ggplot(df_imputed, aes(x = ShoppingMall, y = FoodCourt, color = Transported)) +
    geom_point(alpha = 0.4)

In [None]:
ggplot(df_imputed, aes(x = Spa, y = FoodCourt, color = Transported)) +
    geom_point(alpha = 0.4)

In [None]:
ggplot(df_imputed %>% filter(Spa < 6000, FoodCourt < 10000), aes(x = Spa, y = FoodCourt, color = Transported)) +
    geom_point(alpha = 0.4)

In [None]:
ggplot(df_imputed, aes(x = Spa, y = VRDeck, color = Transported)) +
    geom_point(alpha = 0.4)

In [None]:
ggplot(df_imputed, aes(x = RoomService, y = FoodCourt, color = Transported)) +
    geom_point(alpha = 0.4)

In [None]:
ggplot(df_imputed, aes(x = RoomService, y = ShoppingMall, color = Transported)) +
    geom_point(alpha = 0.4)

In [None]:
ggplot(df_imputed %>% filter(RoomService < 4000, ShoppingMall < 4000), aes(x = RoomService, y = ShoppingMall, color = Transported)) +
    geom_point(alpha = 0.4)

In [None]:
df_imputed %>% mutate(SM_FC = ShoppingMall + FoodCourt) %>%
    ggplot(., aes(x = SM_FC, y = Spa, color = Transported)) +
    geom_point()

In [None]:
df_imputed %>% mutate(SM_FC = ShoppingMall + FoodCourt) %>%
    ggplot(., aes(x = SM_FC, y = RoomService, color = Transported)) +
    geom_point()

In [None]:
df_imputed %>% mutate(SM_FC = ShoppingMall + FoodCourt) %>%
    ggplot(., aes(x = SM_FC, y = VRDeck, color = Transported)) +
    geom_point()

In [None]:
df_imputed %>% mutate(SM_FC = ShoppingMall + FoodCourt, VD_SP = Spa + VRDeck) %>%
    ggplot(., aes(x = SM_FC, y = VD_SP, color = Transported)) +
    geom_point(alpha = 0.5)

In [None]:
df_imputed %>% mutate(SM_FC = ShoppingMall + FoodCourt, VD_SP = Spa + VRDeck) %>%
    ggplot(., aes(x = Age, y = SM_FC, color = Transported)) +
    geom_point(alpha = 0.5)

In [None]:
df_imputed %>% mutate(SM_FC = ShoppingMall + FoodCourt, VD_SP = Spa + VRDeck) %>%
    ggplot(., aes(x = Age, y = VD_SP, color = Transported)) +
    geom_point(alpha = 0.5)

# Correlación 

In [None]:
library(corrplot)

In [None]:
df_imputed %>% select_if(is.numeric) %>% select(-Cabin_num) %>%
     cor(method = 'spearman') %>% corrplot(method = 'number')

No se ve nada relevante en la matriz de correlación.

In [None]:
str(df_imputed)

In [None]:
df_imputed = df_imputed %>% select(-Cabin_num, -Name, -Alone)
write.csv(df_imputed,'../data/train_pr.csv', row.names = FALSE)

### Conclusiones

* CryoSleep es muy importante y condiciona a las numéricas.
* Parece ser que los tripulantes con CryoSleep y transportados se ubican en las cubiertas A, B, C, D mayoritariamente. Además, si provienen de Europa o Marte seguramente sean pasajeros transportados.
* Parece que valores elevados en FoodCourt y ShoppingMall indican tendencia a ser transportado. Podría ser muy interesante introducir una nueva variable que fuese la suma de ambas y reducir así la dimensionalida de los datos. Por otro lado, VRDeck y Spa parecen tener el comportamiento opuesto a las dos anteriores.
* RoomService no da excesiva información a priori.
* VIP no parece ser especialmente relevante.
* Alone no parece dar mucha información tampoco.

## Missing values en el conjunto de test

In [None]:
# Cargamos datos
df_test = read.csv("../data/test.csv", header = TRUE)
head(df_test)

In [None]:
missmap(df_test)

### Variables categóricas

In [None]:
cabin_splitted_test = str_split(df_test$Cabin, '/', simplify = TRUE)

In [None]:
table(cabin_splitted_test[, 1])

In [None]:
table(cabin_splitted_test[, 3])

In [None]:
df_test_imputed = df_test %>% mutate(Cabin_deck = cabin_splitted_test[, 1], 
                   Cabin_num = as.integer(cabin_splitted_test[, 2]), 
                   Cabin_side = cabin_splitted_test[, 3]) %>% select(-Cabin)

In [None]:
df_test_imputed$Cabin_deck[df_test_imputed$Cabin_deck == ""] = NA
df_test_imputed$Cabin_num[df_test_imputed$Cabin_num == ""] = NA
df_test_imputed$Cabin_side[df_test_imputed$Cabin_side == ""] = NA
df_test_imputed$HomePlanet[df_test_imputed$HomePlanet == ""] = NA
df_test_imputed$Destination[df_test_imputed$Destination == ""] = NA
df_test_imputed$CryoSleep[df_test_imputed$CryoSleep == ""] = NA
df_test_imputed$VIP[df_test_imputed$VIP == ""] = NA

In [None]:
missmap(df_test_imputed)

In [None]:
df_test_imputed$Cabin_deck[is.na(df_test_imputed$Cabin_deck)] = mode_cabin_deck
df_test_imputed$Cabin_side[is.na(df_test_imputed$Cabin_side)] = mode_cabin_side
df_test_imputed$HomePlanet[is.na(df_test_imputed$HomePlanet)] = mode_hp
df_test_imputed$Destination[is.na(df_test_imputed$Destination)] = mode_dest
df_test_imputed$VIP[is.na(df_test_imputed$VIP)] = mode_vip

In [None]:
missmap(df_test_imputed)

In [None]:
df_test_imputed %>% count(CryoSleep)

In [None]:
df_test_imputed %>% filter(RoomService==0,FoodCourt == 0, ShoppingMall == 0, Spa == 0,
                     VRDeck == 0) %>% count(CryoSleep)

In [None]:
cs_idTrue_test = df_test_imputed %>% filter(is.na(CryoSleep),RoomService==0,
                                  FoodCourt == 0, ShoppingMall == 0, Spa == 0,
                                  VRDeck == 0) %>% select(PassengerId)

df_test_imputed[df_test_imputed$PassengerId %in% cs_idTrue_test[,1],]$CryoSleep = 'True'

In [None]:
df_test_imputed %>% filter(RoomService!=0 | FoodCourt != 0 | ShoppingMall != 0 | Spa != 0 |
                     VRDeck != 0) %>% count(CryoSleep)

In [None]:
cs_idFalse_test = df_test_imputed %>% filter(is.na(CryoSleep), (RoomService!=0 | FoodCourt != 0 | ShoppingMall != 0 | Spa != 0 |
                     VRDeck != 0)) %>% select(PassengerId) 

In [None]:
df_test_imputed[df_test_imputed$PassengerId %in% cs_idFalse_test[,1],]$CryoSleep = 'False'

In [None]:
df_test_imputed %>% count(CryoSleep)

In [None]:
df_test_imputed %>% filter(is.na(CryoSleep))

In [None]:
cs_idTrue_NAnum_test = df_test_imputed %>% filter(is.na(CryoSleep)) %>% select(PassengerId)

In [None]:
df_test_imputed[df_test_imputed$PassengerId %in% cs_idTrue_NAnum_test[,1],]$CryoSleep = 'True'

In [None]:
df_test_imputed %>% count(CryoSleep)

In [None]:
missmap(df_test_imputed)

### Variables numéricas

In [None]:
df_test_imputed$Age[is.na(df_test_imputed$Age)] = age_median

In [None]:
missmap(df_test_imputed)

In [None]:
df_test_imputed %>% filter(CryoSleep == "True", (is.na(RoomService) | is.na(VRDeck) | is.na(Spa) | is.na(ShoppingMall) | is.na(FoodCourt))) %>%
                  mutate(total_expenses = RoomService + FoodCourt + Spa + ShoppingMall + VRDeck) %>%
                  filter(is.na(total_expenses)) %>% head()

In [None]:
expenses_IdZero_RoomService_test = df_test_imputed %>% filter(CryoSleep == "True", is.na(RoomService)) %>% select(PassengerId)
expenses_IdZero_VRDeck_test = df_test_imputed %>% filter(CryoSleep == "True", is.na(VRDeck)) %>% select(PassengerId)
expenses_IdZero_Spa_test = df_test_imputed %>% filter(CryoSleep == "True", is.na(Spa)) %>% select(PassengerId)
expenses_IdZero_ShoppingMall_test = df_test_imputed %>% filter(CryoSleep == "True", is.na(ShoppingMall)) %>% select(PassengerId)
expenses_IdZero_FoodCourt_test = df_test_imputed %>% filter(CryoSleep == "True", is.na(FoodCourt)) %>% select(PassengerId)

df_test_imputed[df_test_imputed$PassengerId %in% expenses_IdZero_RoomService_test[,1],]$RoomService = 0
df_test_imputed[df_test_imputed$PassengerId %in% expenses_IdZero_VRDeck_test[,1],]$VRDeck = 0
df_test_imputed[df_test_imputed$PassengerId %in% expenses_IdZero_Spa_test[,1],]$Spa = 0
df_test_imputed[df_test_imputed$PassengerId %in% expenses_IdZero_ShoppingMall_test[,1],]$ShoppingMall = 0
df_test_imputed[df_test_imputed$PassengerId %in% expenses_IdZero_FoodCourt_test[,1],]$FoodCourt = 0

In [None]:
df_test_imputed %>% filter(CryoSleep == "False", (is.na(RoomService) | is.na(VRDeck) |
                      is.na(Spa) | is.na(ShoppingMall) | is.na(FoodCourt))) %>% 
                      head()

In [None]:
expenses_IdMedian_RoomService_test = df_test_imputed %>% filter(CryoSleep == "False", is.na(RoomService)) %>% select(PassengerId)
expenses_IdMedian_VRDeck_test = df_test_imputed %>% filter(CryoSleep == "False", is.na(VRDeck)) %>% select(PassengerId)
expenses_IdMedian_Spa_test = df_test_imputed %>% filter(CryoSleep == "False", is.na(Spa)) %>% select(PassengerId)
expenses_IdMedian_ShoppingMall_test = df_test_imputed %>% filter(CryoSleep == "False", is.na(ShoppingMall)) %>% select(PassengerId)
expenses_IdMedian_FoodCourt_test = df_test_imputed %>% filter(CryoSleep == "False", is.na(FoodCourt)) %>% select(PassengerId)

In [None]:
df_test_imputed[df_test_imputed$PassengerId %in% expenses_IdMedian_RoomService_test[,1],]$RoomService = rs_median
df_test_imputed[df_test_imputed$PassengerId %in% expenses_IdMedian_VRDeck_test[,1],]$VRDeck = vd_median
df_test_imputed[df_test_imputed$PassengerId %in% expenses_IdMedian_Spa_test[,1],]$Spa = spa_median
df_test_imputed[df_test_imputed$PassengerId %in% expenses_IdMedian_ShoppingMall_test[,1],]$ShoppingMall = sm_median
df_test_imputed[df_test_imputed$PassengerId %in% expenses_IdMedian_FoodCourt_test[,1],]$FoodCourt = fc_median

In [None]:
missmap(df_test_imputed)

In [None]:
df_test_imputed = df_test_imputed %>% select(-Cabin_num, -Name)
write.csv(df_test_imputed,'../data/test_pr.csv', row.names = FALSE)