#### 6.4.2.2 Penalization or regularization

In addition to restricting the number of parameters, we can also restrict the range of the parameters. This can be done by placing penalties on the magnitudes of the parameters. Two popular choices are the $\ell_1$-norm and $\ell_2$-norm on the parameters. 

- $\ell_1$-norm (lasso): $P(\beta_1,\ldots,\beta_p) = |\beta_1| + \ldots + |\beta_p|.$

- $\ell_2$-norm (ridge, Tikhonov): $P(\beta_1,\ldots,\beta_p) = \beta_1^2 + \ldots + \beta_p^2.$

We then choose $\beta_0,\beta_1,\ldots, \beta_k$, to minimize
	\[
	 \left[L\left( y,  {\beta_0 + \beta_1 x_1 + ... +\beta_k x_k}\right)\right] + \lambda P\left(\beta_1,\ldots,\beta_k\right),
	\]
where $\lambda$ modulates the degree of penalization. In other words, increasing $\lambda$ reults in a less complex model.  

In [1]:
library(tidyverse)
hotel <- read_csv("../Data/hotel_bookings.csv") %>% print
# Consider a smaller set:
hotel<- hotel %>% select(is_canceled, adr, lead_time, total_of_special_requests, stays_in_week_nights, stays_in_weekend_nights, previous_cancellations)
hotel<- hotel %>% mutate(is_canceled = (is_canceled==1));
hotel<-hotel %>% filter(adr<1000)
hotel<-hotel[1:1000,]
hotel %>% print


-- [1mAttaching packages[22m ------------------------------------------------------------------------------- tidyverse 1.3.0 --

[32mv[39m [34mggplot2[39m 3.3.2     [32mv[39m [34mpurrr  [39m 0.3.4
[32mv[39m [34mtibble [39m 3.0.4     [32mv[39m [34mdplyr  [39m 1.0.2
[32mv[39m [34mtidyr  [39m 1.1.2     [32mv[39m [34mstringr[39m 1.4.0
[32mv[39m [34mreadr  [39m 1.4.0     [32mv[39m [34mforcats[39m 0.5.0

-- [1mConflicts[22m ---------------------------------------------------------------------------------- tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


[36m--[39m [1m[1mColumn specification[1m[22m [36m------------------------------------------------------------------------------------------------[39m
cols(
  .default = col_double(),
  hotel = [31mcol_character()[39m,
  arrival_date_month = [31mcol_character()[39

[90m# A tibble: 119,390 x 32[39m
   hotel is_canceled lead_time arrival_date_ye~ arrival_date_mo~
   [3m[90m<chr>[39m[23m       [3m[90m<dbl>[39m[23m     [3m[90m<dbl>[39m[23m            [3m[90m<dbl>[39m[23m [3m[90m<chr>[39m[23m           
[90m 1[39m Reso~           0       342             [4m2[24m015 July            
[90m 2[39m Reso~           0       737             [4m2[24m015 July            
[90m 3[39m Reso~           0         7             [4m2[24m015 July            
[90m 4[39m Reso~           0        13             [4m2[24m015 July            
[90m 5[39m Reso~           0        14             [4m2[24m015 July            
[90m 6[39m Reso~           0        14             [4m2[24m015 July            
[90m 7[39m Reso~           0         0             [4m2[24m015 July            
[90m 8[39m Reso~           0         9             [4m2[24m015 July            
[90m 9[39m Reso~           1        85             [4m2[24m015 J

In [4]:
#glm(, family='binomial')

library(glmnet)
mod.can.lg.lasso<-glmnet(as.matrix(hotel[,-1]), hotel$is_canceled,family='binomial')

In [8]:
length(round(mod.can.lg.lasso$lambda,4))

In [7]:
mod.can.lg.lasso$beta

   [[ suppressing 48 column names 's0', 's1', 's2' ... ]]



6 x 48 sparse Matrix of class "dgCMatrix"
                                                                            
adr                       . 0.0007123483 0.001362282 0.001956001 0.002498846
lead_time                 . .            .           .           .          
total_of_special_requests . .            .           .           .          
stays_in_week_nights      . .            .           .           .          
stays_in_weekend_nights   . .            .           .           .          
previous_cancellations    . .            .           .           .          
                                                                         
adr                       0.00299548 0.003450015 0.003866117 4.251768e-03
lead_time                 .          .           .           1.918994e-05
total_of_special_requests .          .           .           .           
stays_in_week_nights      .          .           .           .           
stays_in_weekend_nights   .          .           

In [10]:
mod.can.lg.ridge<-glmnet(as.matrix(hotel[,-1]), hotel$is_canceled,family='binomial',alpha=0)

In [12]:
round(mod.can.lg.ridge$lambda,4)

In [13]:
mod.can.lg.ridge$beta

   [[ suppressing 100 column names 's0', 's1', 's2' ... ]]



6 x 100 sparse Matrix of class "dgCMatrix"
                                                                   
adr                        1.768138e-39  2.391314e-05  2.623702e-05
lead_time                  5.330966e-40  7.218461e-06  7.921179e-06
total_of_special_requests -3.085461e-38 -4.176319e-04 -4.582632e-04
stays_in_week_nights       7.925425e-39  1.070199e-04  1.173891e-04
stays_in_weekend_nights   -1.100074e-38 -1.496050e-04 -1.642361e-04
previous_cancellations     .             .             .           
                                                                   
adr                        2.878672e-05  3.158331e-05  3.465051e-05
lead_time                  8.692096e-06  9.537893e-06  1.046581e-05
total_of_special_requests -5.028415e-04 -5.517455e-04 -6.053925e-04
stays_in_week_nights       1.287742e-04  1.412570e-04  1.549422e-04
stays_in_weekend_nights   -1.803029e-04 -1.979470e-04 -2.173244e-04
previous_cancellations     .             .             .           
     