# Data Cleaning for Model Fitting
The data we have are problematic - we have 8 categorical variables that need to be split into model matrices, and some of those variables are incompatible with the test data, as there are extra factor levels.  This script will clean the training/test data into two separate data frames each: One for the base run, and one for the run without X0, X2,and X5, so that the remaining test observations can be properly fit.

In [1]:
options(stringsAsFactors = FALSE)
library(dplyr, quietly = TRUE, warn.conflicts = FALSE)

"package 'dplyr' was built under R version 3.3.2"

In [2]:
train <- read.csv('../data/train.csv')
test <- read.csv('../data/test.csv')

In [3]:
# Extract covariates
covars <- train %>% select(starts_with('X'))
cat_names <- colnames(covars)[sapply(covars, function(x) class(x) == 'character')]
num_names <- colnames(covars)[sapply(covars, function(x) class(x) != 'character')]

# Categorical variables
cat_vars_train <- train[cat_names]
cat_vars_test <- test[cat_names]

# Quantitative variables
num_vars_train <- train[num_names]
num_vars_test <- test[num_names]

### Prune Categorical Variables
Create model matrices, make them full rank, and create indices that will be used in the grouped lasso

In [4]:
# Produce categorical model matrices
cat_mm_train <- as.data.frame(model.matrix(~ . - 1, data = cat_vars_train))
cat_mm_test <- as.data.frame(model.matrix(~ . -1, data = cat_vars_test))

In [5]:
# This removes first instance of a categorical variable with duplicates.
# e.g. X0a, X0b, X0c, ... etc.  This would remove X0a.
keep_cat1 <- duplicated(substr(colnames(cat_mm_train), 1, 2))
cat_mm_train <- cat_mm_train[keep_cat1]

In [6]:
# Create vector of integers to be used as grplasso index. (For model 1 only.)
grp_ind_1 <- as.numeric(substr(colnames(cat_mm_train), 2, 2)) + 1

### Prune Quantitative Variables
Remove quantitative variables that are "too" unary

In [7]:
# "Too unary" = 1% away from being all zeros or all ones
cut_tol <- 0.01

# Look at means (purity) of each variable
bin_purity <- unname(sapply(num_vars_train, mean))
summary(bin_purity)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.000000 0.004217 0.022330 0.157700 0.195800 0.999800 

In [8]:
# Trim based on this tolerance cut.  Keeping 221 variables.
num_keep <- (bin_purity < 1 - cut_tol) & (bin_purity > cut_tol)
table(num_keep)

num_keep
FALSE  TRUE 
  147   221 

In [9]:
# Remove variables we don't want
num_vars_train <- num_vars_train[num_keep]
num_vars_test <- num_vars_test[num_keep]

### Prepare Objects for Export
The model will be fully fit in notebook 02.  Here, we assemble the pieces needed to fit model 1 (the full model, using all variables as decided upon above) and model 2 (model 1, but with X0, X2, and X5 removed to fit observations in the test data which have extra factor levels for those variables).

In [10]:
# Everything needed for model 1.
y1 <- log(train$y)
x1 <- as.matrix(cbind(1, cat_mm_train, num_vars_train))
grp_ind_1 <- c(NA, grp_ind_1, rep(NA, ncol(num_vars_train)))

In [11]:
# Everything needed for model 2.
cat_mm_train2 <- cat_mm_train %>%
    select(-starts_with('X0'),
           -starts_with('X2'),
           -starts_with('X5'))

keep_cat2 <- duplicated(substr(colnames(cat_mm_train2), 1, 2))
cat_mm_train2 <- cat_mm_train2[keep_cat2]

x2 <- as.matrix(cbind(1, cat_mm_train2, num_vars_train))
grp_ind_2 <- as.numeric(substr(colnames(cat_mm_train2), 2, 2)) + 1
grp_ind_2 <- c(NA, grp_ind_2, rep(NA, ncol(num_vars_train)))

In [12]:
# Export everything needed for 02
save(x1, y1, grp_ind_1, file = '../intermed-rdatas/m1_objects.RData')
save(x2, grp_ind_2, file = '../intermed-rdatas/m2_objects.RData')
save(cat_mm_test, num_vars_test, file = '../intermed-rdatas/test_objects.RData')

### To Be Continued...
Model will finally be fit in notebook 02!