In [2]:
library(caret)
library(tidyverse)

-- [1mAttaching core tidyverse packages[22m ------------------------ tidyverse 2.0.0 --
[32mv[39m [34mdplyr    [39m 1.1.3     [32mv[39m [34mreadr    [39m 2.1.4
[32mv[39m [34mforcats  [39m 1.0.0     [32mv[39m [34mstringr  [39m 1.5.0
[32mv[39m [34mlubridate[39m 1.9.3     [32mv[39m [34mtibble   [39m 3.2.1
[32mv[39m [34mpurrr    [39m 1.0.2     [32mv[39m [34mtidyr    [39m 1.3.0
-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[31mx[39m [34mpurrr[39m::[32mlift()[39m   masks [34mcaret[39m::lift()
[36mi[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


In [10]:
# read in the dataset
path = "./Hunger Games DS.csv"
data = read.csv(path)
# add a column to the dataframe to indicate whether they survived the first day
surv_day1 = ifelse(data$survival_days > 1, 1, 0)
# add a column to indicate if they had a name
has_name = ifelse(data$name == "unknown", 0, 1)
# change the name of the "sex" variate to "female"
colnames(data)[4] = "female"
# combine the dataframe with the new variables above
data = cbind(data, surv_day1, has_name)

In [15]:
# set the formula for the regression
formula_reg = surv_day1 ~ female + age + volunteer + has_name
# do the regression
model = glm(formula = formula_reg, data = data, family = binomial)
summary(model)


Call:
glm(formula = formula_reg, family = binomial, data = data)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept)  -21.4579    13.8648  -1.548    0.122
female         1.1860     1.7692   0.670    0.503
age            1.1814     0.8111   1.457    0.145
volunteer      4.9915     3.0503   1.636    0.102
has_name      23.8143  4145.1979   0.006    0.995

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 33.104  on 23  degrees of freedom
Residual deviance: 11.422  on 19  degrees of freedom
AIC: 21.422

Number of Fisher Scoring iterations: 19


**a)** Unusual Output

We see that the coefficient value of variate *has_name* is incredibly high, with incredibly high standard error to boot. 

Referring to Allison (2008), and the issue of *complete/quasi-complete separation*, if a single variate almost entirely predicts the target variable, it is said to be quasi-complete. As we can see from the data itself:

In [18]:
cbind(data$surv_day1, data$has_name)

0,1
1,1
1,1
1,1
1,1
1,0
0,0
0,0
1,0
0,0
1,1


The has_name variate almost perfectly predicts the survival of day 1 on its own. This makes total sense with regards to the dataset itself - if a character has a name in the book, they are likely to survive as their presence is essential towards the plot. If a character doesn't have a name, it is likely that they are simply fodder for the first chapter set in the games themselves. 

(Yet another reason why this book is a phony Battle Royale ripoff, but I digress)

**b)** Discussion of x-variates

Seeing as we can't exactly use $p$-values as a reliable metric due to the lack of convergence, we'll use the odds ratio ($e^{\beta_i}$). 

Looking at all of the coefficients, it seems that there aren't any that actually reduce your odds of survival (should we have had negative coefficients, this would not be the case). Now, with all other variables held constant...

- $\beta_{\text{female}} = 1.1860 \implies e^{\beta_{\text{female}}} \approx 3.27$ - it is implied that being female increases your odds of survival by roughly 3.27 times compared to being a male, **with all other variables held constant**
- $\beta_{\text{age}} = 1.1814 \implies e^{\beta_{\text{age}}} \approx 3.26$ - it is implied that one additional year of age will increase your odds of survival by 3.26 times, **with all other variables held constant**
    - With the exception of Rue (who got hard carried by Katniss during the games), most of the extremely young participants died on the first day. With a more matured body comes better adaptability for such a game. 
- $\beta_{\text{volunteer}} = 4.9915 \implies e^{\beta_{\text{volunteer}}} \approx 147.16$ - it is implied that volunteering to compete increases your odds of survival by roughly 147.16 times, **with all other variables held constant**. 
    - This makes sense, as those who volunteered as tributes had an average survival length of 9.625 days, with one of the winners of the games, Katniss, being a volunteer herself. This is in contrast to the average survival time of those who did not volunteer - 4.875. If you volunteer to participate in a death game, you are pretty likely to do well in said death game compared to those who just get thrown in. 