## Design Matrices from Data and Model

Here, we will see how the design matrix can be constructed from the Data in a DataFrame and a model given as a string.

### Data

In [6]:
using DataFrames, SparseArrays, LinearAlgebra
animal= ["animal1","animal2","animal3","animal4","animal5","animal6"]
sex   = ["m","f","f","m","f","f"]
breed = ["Angus","Angus","Hereford","Hereford","Angus","Angus"]
age   = [40,36,38,42,40,36]
df    = DataFrame(animal=animal,sex=sex,breed=breed,age=age,y=round.(randn(6),digits=3))

Unnamed: 0_level_0,animal,sex,breed,age,y
Unnamed: 0_level_1,String,String,String,Int64,Float64
1,animal1,m,Angus,40,-1.15
2,animal2,f,Angus,36,-0.549
3,animal3,f,Hereford,38,1.172
4,animal4,m,Hereford,42,-0.993
5,animal5,f,Angus,40,-0.287
6,animal6,f,Angus,36,1.85


### Model

$$
y_{ij} = \mu + sex_i + e_{ij}.
$$

We have seen previously how to construct the design matrix for a one-way model when the levels had sequential integer values. In this DataFrame the levels of sex are `m` and `f`. We will see below how to assign sequential integers to these strings "m" and "f"

#### Get the levels of sex from the DataFrame into a vector `A`

In [2]:
A=df[:,:sex]

6-element Array{String,1}:
 "m"
 "f"
 "f"
 "m"
 "f"
 "f"

The `unique` function returns the unique levels of a vector:

In [3]:
res = unique(A)

2-element Array{String,1}:
 "m"
 "f"

Now, can make a dictionary were each unique level is the key and the associated value is the sequential integer:

In [4]:
dictA = Dict()                 # declare empty dictionary
for (i,s) in enumerate(res)    # fill the dictionary with the values in res
    dictA[s] = i
end
dictA

Dict{Any,Any} with 2 entries:
  "f" => 2
  "m" => 1

Can use this dictionary to make design matrix:

In [5]:
ii = 1:size(A,1)            # row numbers
jj = [dictA[i] for i in A]  # column numbers
[ii jj]

6×2 Array{Int64,2}:
 1  1
 2  2
 3  2
 4  1
 5  2
 6  2

In [6]:
XA   = sparse(ii,jj,1.0)

6×2 SparseMatrixCSC{Float64,Int64} with 6 stored entries:
  [1, 1]  =  1.0
  [4, 1]  =  1.0
  [2, 2]  =  1.0
  [3, 2]  =  1.0
  [5, 2]  =  1.0
  [6, 2]  =  1.0

In [7]:
n = size(A,1)
Matrix([ones(n,1) XA])

6×3 Array{Float64,2}:
 1.0  1.0  0.0
 1.0  0.0  1.0
 1.0  0.0  1.0
 1.0  1.0  0.0
 1.0  0.0  1.0
 1.0  0.0  1.0

In [7]:
# This function returns a dictionary with the unique values in the vector "a" as the keys and their 
# sequential numbers as the associated values
# It also returns vector with the keys in sequential order. 
function mkDict(a)
  aUnique = unique(a)
  d = Dict()
  names = Array{String}(undef,size(aUnique,1))
  for (i,s) in enumerate(aUnique)
    names[i] = s
    d[s] = i
  end
  return d,names
end

mkDict (generic function with 1 method)

### Model
$$
y_{ij} = \mu + sex_i + breed_j + \beta age_{ij} + e_{ij}
$$

We will use the `mkDict` function to construct the design matrix for this model given the data in DataFrame `df'.

In [9]:
dictA,namesA   = mkDict(A)

(Dict{Any,Any}("f"=>2,"m"=>1), ["m", "f"])

In [10]:
namesA

2-element Array{String,1}:
 "m"
 "f"

In [11]:
ii = 1:size(A,1)
jj = [dictA[i] for i in A]  #list comprehension 
[ii jj]
XA = sparse(ii,jj,1.0)

6×2 SparseMatrixCSC{Float64,Int64} with 6 stored entries:
  [1, 1]  =  1.0
  [4, 1]  =  1.0
  [2, 2]  =  1.0
  [3, 2]  =  1.0
  [5, 2]  =  1.0
  [6, 2]  =  1.0

In [12]:
B = df[:,:breed]
dictB,namesB   = mkDict(B)
jj   = [dictB[i] for i in B]  #list comprehension 
ii   = 1:size(B,1)
XB   = sparse(ii,jj,1.0)

6×2 SparseMatrixCSC{Float64,Int64} with 6 stored entries:
  [1, 1]  =  1.0
  [2, 1]  =  1.0
  [5, 1]  =  1.0
  [6, 1]  =  1.0
  [3, 2]  =  1.0
  [4, 2]  =  1.0

In [13]:
dictB

Dict{Any,Any} with 2 entries:
  "Angus"    => 1
  "Hereford" => 2

In [14]:
CVal = df[:,:age]
CStr = fill("age",size(CVal,1))  # only one column in design matrix for age

6-element Array{String,1}:
 "age"
 "age"
 "age"
 "age"
 "age"
 "age"

In [15]:
dictC,namesC   = mkDict(CStr)
jj   = [dictC[i] for i in CStr]  #list comprehension 
ii   = 1:size(CStr,1)
XC   = sparse(ii,jj,CVal)

6×1 SparseMatrixCSC{Int64,Int64} with 6 stored entries:
  [1, 1]  =  40
  [2, 1]  =  36
  [3, 1]  =  38
  [4, 1]  =  42
  [5, 1]  =  40
  [6, 1]  =  36

In [16]:
n = size(A,1)
Matrix([ones(n,1) XA XB XC])

6×6 Array{Float64,2}:
 1.0  1.0  0.0  1.0  0.0  40.0
 1.0  0.0  1.0  1.0  0.0  36.0
 1.0  0.0  1.0  0.0  1.0  38.0
 1.0  1.0  0.0  0.0  1.0  42.0
 1.0  0.0  1.0  1.0  0.0  40.0
 1.0  0.0  1.0  1.0  0.0  36.0

In [17]:
["intercept"; namesA; namesB; namesC]

6-element Array{String,1}:
 "intercept"
 "m"        
 "f"        
 "Angus"    
 "Hereford" 
 "age"      

### Two-way model with interaction

The $\mathbf{X}$ matrix for the two-way model with interation between breed and sex

$$
y_{ijk} = \mu + sex_i + breed_j + sex_i*breed_j+ e_{ijk}
$$

We already have the design matrices for the main effects.

### Design Matrix for Interaction Term

#### Make vector of levels for interaction:

In [18]:
firstName = "Rohan"
lastName = " Fernando"
firstName * lastName

"Rohan Fernando"

In [19]:
AB = A .*" x ".*B

6-element Array{String,1}:
 "m x Angus"   
 "f x Angus"   
 "f x Hereford"
 "m x Hereford"
 "f x Angus"   
 "f x Angus"   

Use `AB` to construct `XAB` 

In [20]:
dictAB,namesAB   = mkDict(AB)
ii   = 1:size(A,1)
jj   = [dictAB[i] for i in AB]  #list comprehension 
XAB = sparse(ii,jj,1.0)

6×4 SparseMatrixCSC{Float64,Int64} with 6 stored entries:
  [1, 1]  =  1.0
  [2, 2]  =  1.0
  [5, 2]  =  1.0
  [6, 2]  =  1.0
  [3, 3]  =  1.0
  [4, 4]  =  1.0

In [21]:
namesAB

4-element Array{String,1}:
 "m x Angus"   
 "f x Angus"   
 "f x Hereford"
 "m x Hereford"

#### Design Matrix for Model

In [22]:
n = size(A,1)
Matrix([ones(n,1) XA XB XAB])

6×9 Array{Float64,2}:
 1.0  1.0  0.0  1.0  0.0  1.0  0.0  0.0  0.0
 1.0  0.0  1.0  1.0  0.0  0.0  1.0  0.0  0.0
 1.0  0.0  1.0  0.0  1.0  0.0  0.0  1.0  0.0
 1.0  1.0  0.0  0.0  1.0  0.0  0.0  0.0  1.0
 1.0  0.0  1.0  1.0  0.0  0.0  1.0  0.0  0.0
 1.0  0.0  1.0  1.0  0.0  0.0  1.0  0.0  0.0

In [23]:
["intercept"; namesA; namesB; namesAB]

9-element Array{String,1}:
 "intercept"   
 "m"           
 "f"           
 "Angus"       
 "Hereford"    
 "m x Angus"   
 "f x Angus"   
 "f x Hereford"
 "m x Hereford"

### Model with sex-specific slope for age

$$
y_{ij} = \mu + sex_i + \beta_{i}(age_{ij})+ e_{ij}
$$

In [24]:
BVal = df[:,:age]
BStr = fill("age",size(BVal,1))

6-element Array{String,1}:
 "age"
 "age"
 "age"
 "age"
 "age"
 "age"

In [25]:
AB = A.*" x ".*BStr 

6-element Array{String,1}:
 "m x age"
 "f x age"
 "f x age"
 "m x age"
 "f x age"
 "f x age"

In [26]:
dAB,namesAB   = mkDict(AB)
ii    = 1:size(AB,1)
jj   = [dAB[i] for i in AB]  #list comprehension 
XAB   = sparse(ii,jj,BVal)
Matrix(XAB)

6×2 Array{Int64,2}:
 40   0
  0  36
  0  38
 42   0
  0  40
  0  36

#### Design Matrix for Model

In [27]:
Matrix([ones(n,1) XA XAB])

6×5 Array{Float64,2}:
 1.0  1.0  0.0  40.0   0.0
 1.0  0.0  1.0   0.0  36.0
 1.0  0.0  1.0   0.0  38.0
 1.0  1.0  0.0  42.0   0.0
 1.0  0.0  1.0   0.0  40.0
 1.0  0.0  1.0   0.0  36.0

In [28]:
["intercept"; namesA; namesAB]

5-element Array{String,1}:
 "intercept"
 "m"        
 "f"        
 "m x age"  
 "f x age"  

### Function to Construct Design Matrix

#### Function for main effects

We will do this by putting the code we have used earlier into a fucntion.
First, let's make our code work for quantitative or qualitiative factors.

In [29]:
# Test code for qualitative factor
factor = "sex"
cov    = false
data = df[:,Symbol(factor)]

n = size(data,1)
if cov==false
    str = data
    val = 1.0
else 
    str = fill(factor,n)
    val = data
end

dict,colNames   = mkDict(str)
ii    = 1:n                    # row numbers 
jj   = [dict[i] for i in str]  # column numbers
X    = sparse(ii,jj,val)
Matrix(X)    

6×2 Array{Float64,2}:
 1.0  0.0
 0.0  1.0
 0.0  1.0
 1.0  0.0
 0.0  1.0
 0.0  1.0

In [30]:
# Test code for quantitative factor
factor = "age"
cov    = true

data = df[:,Symbol(factor)]
n = size(data,1)
if cov==false
    str = data
    val = 1.0
else 
    str = fill(factor,n)
    val = data
end

dict,colNames   = mkDict(str)
ii    = 1:n                    # row numbers 
jj   = [dict[i] for i in str]  # column numbers
X    = sparse(ii,jj,val)
Matrix(X)

6×1 Array{Int64,2}:
 40
 36
 38
 42
 40
 36

#### Put the code in a function:

In [31]:
function getX(factor,df;cov=false)
    data = df[:,Symbol(factor)]
    n = size(data,1)
    if cov==false
        str = data
        val = 1.0
    else 
        str = fill(factor,n)
        val = data
    end

    dict,names   = mkDict(str)
    ii    = 1:n                    # row numbers 
    jj   = [dict[i] for i in str]  # column numbers
    X    = sparse(ii,jj,val)    
end        

getX (generic function with 1 method)

In [32]:
X = getX("sex",df)
Matrix(X)

6×2 Array{Float64,2}:
 1.0  0.0
 0.0  1.0
 0.0  1.0
 1.0  0.0
 0.0  1.0
 0.0  1.0

In [33]:
X = getX("age",df,cov=true)
Matrix(X)

6×1 Array{Int64,2}:
 40
 36
 38
 42
 40
 36

#### Function for main effects or interations

In [34]:
function getX(factors,covs,df)
    n = size(df,1)
    if covs[1] == false
        str = df[:,Symbol(factors[1])]
        val = 1.0
    else
        str = fill(factors[1],n) 
        val = df[:,Symbol(factors[1])]    
    end       

    for i in 2:length(factors)
        if covs[i] == false
            str = str .*" x ".*df[:,Symbol(factors[i])]
            val = val .* 1.0 
        else
            str = str .*" x ".*fill(factors[i],n) 
            val = val .* df[:,Symbol(factors[i])]    
        end 
    end 
    dict,colNames   = mkDict(str)
    ii    = 1:n                    # row numbers 
    jj   = [dict[i] for i in str]  # column numbers
    X    = sparse(ii,jj,val)
    return X,colNames        
end            

getX (generic function with 2 methods)

In [35]:
factors = ["sex", "breed"]
covs =[false, false]
X,colNames = getX(factors,covs,df)
Matrix(X)

6×4 Array{Float64,2}:
 1.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0
 0.0  0.0  1.0  0.0
 0.0  0.0  0.0  1.0
 0.0  1.0  0.0  0.0
 0.0  1.0  0.0  0.0

In [36]:
colNames

4-element Array{String,1}:
 "m x Angus"   
 "f x Angus"   
 "f x Hereford"
 "m x Hereford"

In [37]:
factors = ["sex"]
covs =[false]
X, colNames = getX(factors,covs,df)
Matrix(X)

6×2 Array{Float64,2}:
 1.0  0.0
 0.0  1.0
 0.0  1.0
 1.0  0.0
 0.0  1.0
 0.0  1.0

In [38]:
colNames

2-element Array{String,1}:
 "m"
 "f"

In [39]:
factors = ["age", "age"]
covs =[true, true]
X,colNames = getX(factors,covs,df)
Matrix(X)

6×1 Array{Int64,2}:
 1600
 1296
 1444
 1764
 1600
 1296

In [40]:
colNames

1-element Array{String,1}:
 "age x age"

### Get factors and covariables from model term

Consider model term: "sex*breed" and vector of covariables in the model: "age"

In [41]:
modelTerm   = "sex * breed"
covariables = ["age"];

In [42]:
?split

search: [0m[1ms[22m[0m[1mp[22m[0m[1ml[22m[0m[1mi[22m[0m[1mt[22m [0m[1ms[22m[0m[1mp[22m[0m[1ml[22m[0m[1mi[22m[0m[1mt[22mext [0m[1ms[22m[0m[1mp[22m[0m[1ml[22m[0m[1mi[22m[0m[1mt[22mdir [0m[1ms[22m[0m[1mp[22m[0m[1ml[22m[0m[1mi[22m[0m[1mt[22mpath [0m[1ms[22m[0m[1mp[22m[0m[1ml[22m[0m[1mi[22m[0m[1mt[22mdrive r[0m[1ms[22m[0m[1mp[22m[0m[1ml[22m[0m[1mi[22m[0m[1mt[22m [0m[1ms[22m[0m[1mp[22m[0m[1ml[22m[0m[1mi[22mce! di[0m[1ms[22m[0m[1mp[22m[0m[1ml[22mays[0m[1mi[22mze



```
split(str::AbstractString, dlm; limit::Integer=0, keepempty::Bool=true)
split(str::AbstractString; limit::Integer=0, keepempty::Bool=false)
```

Split `str` into an array of substrings on occurrences of the delimiter(s) `dlm`.  `dlm` can be any of the formats allowed by [`findnext`](@ref)'s first argument (i.e. as a string, regular expression or a function), or as a single character or collection of characters.

If `dlm` is omitted, it defaults to [`isspace`](@ref).

The optional keyword arguments are:

  * `limit`: the maximum size of the result. `limit=0` implies no maximum (default)
  * `keepempty`: whether empty fields should be kept in the result. Default is `false` without a `dlm` argument, `true` with a `dlm` argument.

See also [`rsplit`](@ref).

# Examples

```jldoctest
julia> a = "Ma.rch"
"Ma.rch"

julia> split(a,".")
2-element Array{SubString{String},1}:
 "Ma"
 "rch"
```


In [43]:
split(modelTerm,"*")

2-element Array{SubString{String},1}:
 "sex "  
 " breed"

In [44]:
factors = strip.(split(modelTerm,"*"))

2-element Array{SubString{String},1}:
 "sex"  
 "breed"

In [45]:
covs = [i in covariables for i in factors]

2-element Array{Bool,1}:
 false
 false

In [8]:
function getX(modelTerm,covariables,df)
    n = size(df,1)
    if modelTerm == "intercept"
        X = ones(n,1)
        colNames = ["intercept"]
        return X,colNames
    end
    factors = strip.(split(modelTerm,"*"))
    covs = [i in covariables for i in factors]
    
    if covs[1] == false
        str = string.(df[:,Symbol(factors[1])])
        val = 1.0
    else
        str = fill(factors[1],n) 
        val = df[:,Symbol(factors[1])]    
    end       

    for i in 2:length(factors)
        if covs[i] == false
            str = str .*" x ".*string.(df[:,Symbol(factors[i])])
            val = val .* 1.0 
        else
            str = str .*" x ".*fill(factors[i],n) 
            val = val .* df[:,Symbol(factors[i])]    
        end 
    end 
    dict,colNames   = mkDict(str)
    ii = 1:n                     # row numbers 
    jj = [dict[i] for i in str]  # column numbers
    X  = sparse(ii,jj,val)
    return X, strip(modelTerm)*": ".*colNames   
end

getX (generic function with 1 method)

In [10]:
modelTerm   = "sex * breed"
covariables = ["age"]
X,colNames = getX(modelTerm,covariables,df)
Matrix(X)

6×4 Array{Float64,2}:
 1.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0
 0.0  0.0  1.0  0.0
 0.0  0.0  0.0  1.0
 0.0  1.0  0.0  0.0
 0.0  1.0  0.0  0.0

In [11]:
colNames

4-element Array{String,1}:
 "sex * breed: m x Angus"   
 "sex * breed: f x Angus"   
 "sex * breed: f x Hereford"
 "sex * breed: m x Hereford"

In [12]:
modelTerm = "breed*age"
X,colNames = getX(modelTerm,covariables,df)
Matrix(X)

6×2 Array{Float64,2}:
 40.0   0.0
 36.0   0.0
  0.0  38.0
  0.0  42.0
 40.0   0.0
 36.0   0.0

In [13]:
colNames

2-element Array{String,1}:
 "breed*age: Angus x age"   
 "breed*age: Hereford x age"

In [14]:
modelTerm = "sex"
X,colNames = getX(modelTerm,covariables,df)
Matrix(X)

6×2 Array{Float64,2}:
 1.0  0.0
 0.0  1.0
 0.0  1.0
 1.0  0.0
 0.0  1.0
 0.0  1.0

In [15]:
colNames

2-element Array{String,1}:
 "sex: m"
 "sex: f"

In [16]:
modelTerm = "intercept"
X,colNames = getX(modelTerm,covariables,df)
Matrix(X)

6×1 Array{Float64,2}:
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0

In [17]:
colNames

1-element Array{String,1}:
 "intercept"

### Get model-terms from string representaion of model

Consider model: 

In [29]:
modelEq = "y = intercept + sex + breed + sex*breed + age"

"y = intercept + sex + breed + sex*breed + age"

In [19]:
modelParts = strip.(split(modelEq,"="))

2-element Array{SubString{String},1}:
 "y"                                                        
 "intercept + sex + breed + sex*breed + age + sex*breed*age"

In [20]:
depVar = modelParts[1]
modelTerms = strip.(split(modelParts[2],"+"))

6-element Array{SubString{String},1}:
 "intercept"    
 "sex"          
 "breed"        
 "sex*breed"    
 "age"          
 "sex*breed*age"

In [21]:
X,colNames = getX(modelTerms[1],covariables,df)
Matrix(X)

6×1 Array{Float64,2}:
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0

In [22]:
i = 2
Xi,namesi = getX(modelTerms[i],covariables,df)
X = [X Xi]
colNames = [colNames; namesi]
Matrix(X)

6×3 Array{Float64,2}:
 1.0  1.0  0.0
 1.0  0.0  1.0
 1.0  0.0  1.0
 1.0  1.0  0.0
 1.0  0.0  1.0
 1.0  0.0  1.0

In [23]:
colNames

3-element Array{String,1}:
 "intercept"
 "sex: m"   
 "sex: f"   

In [24]:
i = 3
Xi,namesi = getX(modelTerms[i],covariables,df)
X = [X Xi]
colNames = [colNames; namesi]
Matrix(X)

6×5 Array{Float64,2}:
 1.0  1.0  0.0  1.0  0.0
 1.0  0.0  1.0  1.0  0.0
 1.0  0.0  1.0  0.0  1.0
 1.0  1.0  0.0  0.0  1.0
 1.0  0.0  1.0  1.0  0.0
 1.0  0.0  1.0  1.0  0.0

In [25]:
colNames

5-element Array{String,1}:
 "intercept"      
 "sex: m"         
 "sex: f"         
 "breed: Angus"   
 "breed: Hereford"

In [27]:
function getLhsRhs(modelEq,covariables,df)
    modelParts = strip.(split(modelEq,"="))
    depVar = modelParts[1]
    y = df[:,Symbol(depVar)]
    modelTerms = strip.(split(modelParts[2],"+"))
    X,colNames = getX(modelTerms[1],covariables,df)
    for i = 2:size(modelTerms,1)
        Xi,namesi = getX(modelTerms[i],covariables,df)
        X = [X Xi]
        colNames = [colNames; namesi]
    end
    return X'X,X'y,colNames
end

getLhsRhs (generic function with 1 method)

In [30]:
lhs,rhs,colNames = getLhsRhs(modelEq,covariables,df)
[colNames Matrix(lhs) rhs]

10×12 Array{Any,2}:
 "intercept"                  6.0   2.0  …    3.0   1.0   1.0   232.0   0.043
 "sex: m"                     2.0   2.0       0.0   0.0   1.0    82.0  -2.143
 "sex: f"                     4.0   0.0       3.0   1.0   0.0   150.0   2.186
 "breed: Angus"               4.0   1.0       3.0   0.0   0.0   152.0  -0.136
 "breed: Hereford"            2.0   1.0       0.0   1.0   1.0    80.0   0.179
 "sex*breed: m x Angus"       1.0   1.0  …    0.0   0.0   0.0    40.0  -1.15 
 "sex*breed: f x Angus"       3.0   0.0       3.0   0.0   0.0   112.0   1.014
 "sex*breed: f x Hereford"    1.0   0.0       0.0   1.0   0.0    38.0   1.172
 "sex*breed: m x Hereford"    1.0   1.0       0.0   0.0   1.0    42.0  -0.993
 "age: age"                 232.0  82.0     112.0  38.0  42.0  9000.0  -7.814

In [31]:
QRLhs = qr(lhs) 
sol = QRLhs\rhs

10-element Array{Float64,1}:
 10.07824999998193   
 -1.2275000000018992 
  0.0                
 -0.990249999999663  
  0.0                
  0.36450000000061267
  0.0                
  0.0                
  0.0                
 -0.23437499999952532