# Dummy Variable Regression
Our starting point for understanding the use of categorical predictors in multiple regression is to discuss *how* this is achieved via *dummy variables*. Once we have established the theory of dummy variables, we will be able to see how the humble $t$-test can be reinterpreted as a regression model. This is then the starting point for understanding how one-way ANOVA, factorial ANOVA and ANCOVA models can all be subsumed under this single framework.

## Dummy Variables
In order to integrate a categorical variable into a regression model, we use what is known as a *dummy variable*. This is a new variable that we insert into the regression model that only has values of '0' or '1'. As an example, let us return to the `mtcars` dataset and create a new categorical variable that indicates whether a car was manufactured in the USA or somewhere else in the world.

In [15]:
data(mtcars)
mtcars$origin <- c('Other','Other','USA','USA','USA','USA','USA','Other','Other','Other',
                   'Other','Other','Other','Other','USA','USA','USA','Other','Other',
                   'Other','Other','USA','USA','USA','USA','Other','Other','Other',
                   'Other','Other','Other','Other')

print(mtcars)

                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb origin
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4  Other
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4  Other
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1    USA
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1    USA
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2    USA
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1    USA
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4    USA
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2  Other
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2  Other
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4  Other
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4  Other
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17

In order to turn this variable into a dummy variable, we could assign a value of 0 to the `Other` category and a value of 1 to the `USA` category. This gives us the following coding scheme 

| Category | Dummy Value |
| -------- | ----------- |
| Other    | 0           |
| USA      | 1           |

### Manual Dummy Vartiables

Althouh not necessary to make dummy variables manually (because `R` will do this automatically for you), it can be instructive to do so at this stage to help your understanding. In this example, we want a new variable containing the dummmy values associated with the `origin` variable. For instance

In [14]:
# Create empty variable 
n          <- length(mtcars$origin)
origin.dum <- numeric(n)

# Assign dummy values
origin.dum[mtcars$origin == 'Other'] <- 0
origin.dum[mtcars$origin == 'USA']   <- 1

# Print
print(origin.dum)

 [1] 0 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0


So what happens if we put this dummy variable into a regression model? ...

### Automatic Dummy Variables
To make sure `R` understands that this is a categorical variable, we should always make sure that we transform it into a class of `factor`. This will then give us scope to rearrange the levels and do other manipulations, rather than `R` treating this as a string variable. 

In [8]:
mtcars$origin <- as.factor(mtcars$origin)

In [9]:
dummy.mod <- lm(mpg ~ origin, data=mtcars)
summary(dummy.mod)


Call:
lm(formula = mpg ~ origin, data = mtcars)

Residuals:
   Min     1Q Median     3Q    Max 
-7.445 -3.595 -1.006  3.164 11.455 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   22.445      1.176  19.079  < 2e-16 ***
originUSA     -6.278      1.921  -3.268  0.00272 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.261 on 30 degrees of freedom
Multiple R-squared:  0.2625,	Adjusted R-squared:  0.238 
F-statistic: 10.68 on 1 and 30 DF,  p-value: 0.002716


## The $t$-Test as a Regression Model

## Different Coding Schemes

Strictly speaking, dummy variables are indicator variables with a value of either 0 or 1. But in practice, we often use the term more loosely to mean any coded variable that stands in for categories. This opens up the possibility for different coding schemes that changes the interpretation of the model parameters.