# Data Pre-Processing

## Dealing with [*Missing* Data](https://en.wikipedia.org/wiki/Missing_data)

In most online _machine learning_ classes you are taught that when your data set is **incomplete** you can either:
1. Erase the corresponding rows with missing cells; or
2. _Impute_ (fill) the sample average of each column into those missing cells.

It turns out that the second option would require stronger assumptions than the first. If the observations are **missing completely at random** and your sample is i.i.d., the first option is harmless for large data sets.

## Encoding [Categorical Variables](https://en.wikipedia.org/wiki/Categorical_variable)

💻 Consider the ```hprice3``` data set from the ```wooldridge``` package:

In [None]:
## installing the 'wooldridge' package if not previously installed
if (!require(wooldridge)) install.packages('wooldridge')

data(hprice3)

## Obs:   321

##  1. year                     1978, 1981
##  2. age                      age of house
##  3. agesq                    age^2
##  4. nbh                      neighborhood, 1 to 6
##  5. cbd                      dist. to central bus. dstrct, feet
##  6. inst                     dist. to interstate, feet
##  7. linst                    log(inst)
##  8. price                    selling price
##  9. rooms                    # rooms in house
## 10. area                     square footage of house
## 11. land                     square footage lot
## 12. baths                    # bathrooms
## 13. dist                     dist. from house to incin., feet
## 14. ldist                    log(dist)
## 15. lprice                   log(price)
## 16. y81                      =1 if year = 1981
## 17. larea                    log(area)
## 18. lland                    log(land)
## 19. linstsq                  linst^2

💻 Variables ```y81```, ```rooms```, and ```nbh``` are examples of <ins>categorical</ins> variables. In Econometrics, ```y81``` is called a standard dummy variable, ```rooms``` is called an _ordered_ categorical variable, ```hbh``` is called an _unordered_ categorical variable. Both ```rooms``` and ```nbh``` have _multiple_ categories. The fucntion ```as.factor()``` with option ```ordered=TRUE``` and ```ordered=FALSE``` (default) will allow us to handle them accordingly in all analysis.

In [None]:
## without using the 'as.factor' function
attach(hprice3)
no.factor <- data.frame(y81=y81,rooms=rooms,nbh=nbh)
summary(no.factor)
detach(hprice3)

## using the 'as.factor' function
attach(hprice3)
yes.factor <- data.frame(y81=factor(y81),
                         rooms=factor(rooms,ordered=TRUE),
                         nbh=factor(nbh,ordered=FALSE)
                         )
summary(yes.factor)
detach(hprice3)

💻 The default behavior in regression is to transformed ordered and unordered categorical variable with multiple categories into a set of $c-1$ dummy variables and include them as regressors, where $c$ represents the number of categories.

In [None]:
ols <- lm(lprice ~ lland + larea + I(log(cbd)) +
                   as.factor(y81) + as.factor(rooms) + as.factor(nbh) +
                   linst + linstsq + ldist + baths + age + agesq,
          data=hprice3)

## installing the 'lmtest', 'sandwich' packages if not previously installed
## installing the 'lmtest' package if not previously installed
if (!require(lmtest)) install.packages('lmtest')
if (!require(sandwich)) install.packages('sandwich')

## turning 'off' scientific notation
options(scipen = 999) 

## calculating standard t-statistics for 'significance'
coeftest(ols, vcov = vcovHC, type = "HC1")

💻 In many machine learning algorithms you are required to provide the design (model) matrix, $\mathbf{X}$ (*without* and intercept), and response vector, $\mathbf{y}$.

In [None]:
X <- model.matrix(ols)[,-1]
dim(X)
colnames(X)

📌 It is good practice to define categorical variables _outside_ the model formula/fitting. When doing this, one can easily change the 'base' category using the ```relevel()``` function along with the ```within()``` function.

## Including [Interaction Terms](https://en.wikipedia.org/wiki/Interaction_(statistics))

In the previously fitted model we included ```linstsq``` and ```agesq``` as predictors. These correspond to the squared of the original predictors ```linst``` and ```age```. In economics we include such predictors to account for increasing/decreasing returns to scale in modelling. Since $\texttt{linst}^2=\texttt{linst}\times\texttt{linst}$ and $\texttt{age}^2=\texttt{age}\times\texttt{age}$, one can think of them as a specific type of interaction terms (products with themselves).