<!-- ---
title: "Week 6"
title-block-banner: true
title-block-style: default
execute:
  freeze: true
  cache: true
format: html
# format: pdf
--- -->

In [15]:

# renv::activate(dir)


#### Packages we will require this week

In [16]:
packages <- c(
    # Old packages
    "ISLR2",
    "dplyr",
    "tidyr",
    "readr",
    "purrr",
    "repr",
    "tidyverse",
    "kableExtra",
    "IRdisplay",
    "car",
    "corrplot",
    # NEW
    "torch",
    "torchvision",
    "luz",
    # Dimension reduction
    "dimRed",
    "RSpectra"
)

# renv::install(packages)
sapply(packages, require, character.only=TRUE)

Loading required package: dimRed

"there is no package called 'dimRed'"
Loading required package: RSpectra

"there is no package called 'RSpectra'"


---


### Agenda:

1. Real-world neural network classification
1. Dataloaders
1. Torch for image classification

<br><br><br>

## Titanic

In [60]:
url <- "https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv"

df <- read_csv(url) %>%
    mutate_if(\(x) is.character(x), as.factor) %>%
    mutate(y = Survived) %>%
    select(-c(Name, Survived)) %>%
    (\(x) {
        names(x) <- tolower(names(x))
        x
    })

df %>% head

[1mRows: [22m[34m887[39m [1mColumns: [22m[34m8[39m
[36m--[39m [1mColumn specification[22m [36m--------------------------------------------------------[39m
[1mDelimiter:[22m ","
[31mchr[39m (2): Name, Sex
[32mdbl[39m (6): Survived, Pclass, Age, Siblings/Spouses Aboard, Parents/Children Ab...

[36mi[39m Use `spec()` to retrieve the full column specification for this data.
[36mi[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


pclass,sex,age,siblings/spouses aboard,parents/children aboard,fare,y
<dbl>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
3,male,22,1,0,7.25,0
1,female,38,1,0,71.2833,1
3,female,26,0,0,7.925,1
1,female,35,1,0,53.1,1
3,male,35,0,0,8.05,0
3,male,27,0,0,8.4583,0


## Breast Cancer Prediction

In [18]:
# url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data"

# col_names <- c("id", "diagnosis", paste0("feat", 1:30))

# df <- read_csv(
#         url, col_names, col_types = cols()
#     ) %>% 
#     select(-id) %>% 
#     mutate(y = ifelse(diagnosis == "M", 1, 0)) %>%
#     select(-diagnosis)


# df %>% head

### Train/Test Split

In [62]:
k <- 5
test_ind <- sample( 1 : nrow(df), floor(nrow(df)/ k ), replace = FALSE)
test_ind

In [63]:
df_train <- df[-test_ind, ]
df_test <- df[test_ind, ]

nrow(df_train) + nrow(df_test)

### Benchmark with Logistic Regression

In [64]:
fit_glm <- glm( y ~ . , df_train %>% mutate_at("y", factor), family = binomial())
glm_test <- predict(fit_glm, df_test, output = "reponse")

glm_preds <- ifelse(glm_test > 0.5, 1, 0)
table(glm_preds, df_test$y)

         
glm_preds  0  1
        0 99 34
        1  6 38

### Neural Net Model

In [65]:
nn_model <- nn_module(
    initialize = function(p, q1, q2, q3){
        self$hidden1 <- nn_linear(p, q1)
        self$hidden2 <- nn_linear(q1, q2)
        self$hidden3 <- nn_linear(q2, q3)
        self$output <- nn_linear(q3, 1)
        self$activation <- nn_relu()
        self$sigmoid <- nn_sigmoid()
    },
    forward = function(x){
        x %>% 
        self$hidden1() %>% self$activation() %>%
        self$hidden2() %>% self$activation() %>%
        self$hidden3() %>% self$activation() %>%
        self$output() %>% self$sigmoid()
    }
)

### Fit using Luz

In [66]:
M <- model.matrix(y ~ 0 + . , data = df_train)
M

Unnamed: 0,pclass,sexfemale,sexmale,age,`siblings/spouses aboard`,`parents/children aboard`,fare
1,3,0,1,22,1,0,7.2500
2,1,1,0,38,1,0,71.2833
3,3,1,0,26,0,0,7.9250
4,1,1,0,35,1,0,53.1000
5,3,0,1,35,0,0,8.0500
6,3,0,1,27,0,0,8.4583
7,1,0,1,54,0,0,51.8625
8,3,0,1,2,3,1,21.0750
9,2,1,0,14,1,0,30.0708
10,3,1,0,4,1,1,16.7000


In [68]:
fit_nn <- nn_model %>% 
    setup(loss = nn_bce_loss(),
        optimizer = optim_adam, 
        metrics = list(luz_metric_accuracy())) %>%
    set_hparams(p = ncol(M), q1 = 256, q2 = 128, q3 = 64) %>%
    set_opt_hparams(lr = 0.005) %>%
    fit(data = list(
        model.matrix(y ~ 0 + ., data = df_train), df_train %>% select(y) %>% as.matrix
    ),
        valid_data = list(
        model.matrix(y ~ 0 + ., data = df_test), df_test %>% select(y) %>% as.matrix

        ),
        epochs = 50, verbose = TRUE)

Epoch 1/50


Train metrics: Loss: 0.6777 - Acc: 12.2273
Valid metrics: Loss: 0.6129 - Acc: 12.5932
Epoch 2/50
Train metrics: Loss: 0.6186 - Acc: 12.1818
Valid metrics: Loss: 0.6267 - Acc: 12.5932
Epoch 3/50
Train metrics: Loss: 0.5956 - Acc: 12.1364
Valid metrics: Loss: 0.593 - Acc: 12.5932
Epoch 4/50
Train metrics: Loss: 0.5862 - Acc: 12.2273
Valid metrics: Loss: 0.5774 - Acc: 12.5932
Epoch 5/50
Train metrics: Loss: 0.5674 - Acc: 12.2273
Valid metrics: Loss: 0.5466 - Acc: 12.5932
Epoch 6/50
Train metrics: Loss: 0.5402 - Acc: 12.1364
Valid metrics: Loss: 0.5258 - Acc: 12.5932
Epoch 7/50
Train metrics: Loss: 0.5164 - Acc: 12.1818
Valid metrics: Loss: 0.5001 - Acc: 12.5932
Epoch 8/50
Train metrics: Loss: 0.5182 - Acc: 12.2273
Valid metrics: Loss: 0.5305 - Acc: 12.5932
Epoch 9/50
Train metrics: Loss: 0.5151 - Acc: 12.0455
Valid metrics: Loss: 0.4877 - Acc: 12.5932
Epoch 10/50
Train metrics: Loss: 0.5018 - Acc: 12.0909
Valid metrics: Loss: 0.5027 - Acc: 12.5932
Epoch 11/50
Train metrics: Loss: 0.4955 -

Luz expects the data to be input as a list. This is important for the fit part of the code above. In this
list we need to specify our X and Y. We select every variable but the response in one list and then just
the response variable in an other. Donʼt forget to use as.matrix() .


Also its very useful because it doesnʼt matter what the data set looks like as long as the response
variable is labeled y the code will work. Shown by the switch of the df from the breast cancer data set
to the titanic data set.


We can now make predictions on this model using the predict function.
If a model has 0 intercept then when the intercept is 0 then everything else is also 0. This can be useful
at times. For example if modeling horsepower to car price then it makes sense that if the horsepower is
0 then so should the price.


In this case the neural net did worse than logistic regression but the good part is that we can change
some parameters like the learning rate that will give us better results. Also these were small data sets
but if we were to do this with bigger ones then the neural nets would outperform the logistic regression
models in a more noticable way