## Design Matrices from Data and Model

Here, we will see how the design matrix can be constructed from the Data in a DataFrame and a model given as a string.

### Data

In [1]:
using DataFrames, SparseArrays
animal= ["animal1","animal2","animal3","animal4","animal5","animal6"]
sex   = ["m","f","f","m","f","f"]
breed = ["Angus","Angus","Hereford","Hereford","Angus","Angus"]
age   = [40,36,38,42,40,36]
df    = DataFrame(animal=animal,sex=sex,breed=breed,age=age,y=round.(randn(6),digits=3))

Unnamed: 0_level_0,animal,sex,breed,age,y
Unnamed: 0_level_1,String,String,String,Int64,Float64
1,animal1,m,Angus,40,0.488
2,animal2,f,Angus,36,-1.519
3,animal3,f,Hereford,38,-0.795
4,animal4,m,Hereford,42,0.775
5,animal5,f,Angus,40,0.422
6,animal6,f,Angus,36,-0.238


### Model

$$
y_{ij} = \mu + sex_i + e_{ij}.
$$

We have seen previously how to construct the design matrix for a one-way model when the levels had sequential integer values. In this DataFrame the levels of sex are `m` and `f`. We will see below how to assign sequential integers to these strings "m" and "f"

#### Get the levels of sex from the DataFrame into a vector `A`

In [2]:
A=df[:,:sex]

6-element Array{String,1}:
 "m"
 "f"
 "f"
 "m"
 "f"
 "f"

The `unique` function returns the unique levels of a vector:

In [3]:
res = unique(A)

2-element Array{String,1}:
 "m"
 "f"

Now, can make a dictionary were each unique level is the key and the associated value is the sequential integer:

In [4]:
dictA = Dict()                 # declare empty dictionary
for (i,s) in enumerate(res)    # fill the dictionary with the values in res
    dictA[s] = i
end
dictA

Dict{Any,Any} with 2 entries:
  "f" => 2
  "m" => 1

Can use this dictionary to make design matrix:

In [5]:
ii = 1:size(A,1)            # row numbers
jj = [dictA[i] for i in A]  # column numbers

6-element Array{Int64,1}:
 1
 2
 2
 1
 2
 2

In [5]:
[ii jj]

6×2 Array{Int64,2}:
 1  1
 2  2
 3  2
 4  1
 5  2
 6  2

In [6]:
XA   = sparse(ii,jj,1.0)

6×2 SparseMatrixCSC{Float64,Int64} with 6 stored entries:
  [1, 1]  =  1.0
  [4, 1]  =  1.0
  [2, 2]  =  1.0
  [3, 2]  =  1.0
  [5, 2]  =  1.0
  [6, 2]  =  1.0

In [7]:
n = size(A,1)
Matrix([ones(n,1) XA])

6×3 Array{Float64,2}:
 1.0  1.0  0.0
 1.0  0.0  1.0
 1.0  0.0  1.0
 1.0  1.0  0.0
 1.0  0.0  1.0
 1.0  0.0  1.0

In [8]:
# This function returns a dictionary with the unique values in the vector "a" as the keys and their 
# sequential numbers as the associated values
# It also returns vector with the keys in sequential order. 
function mkDict(a)
  aUnique = unique(a)
  d = Dict()
  names = Array{String}(undef,size(aUnique,1))
  for (i,s) in enumerate(aUnique)
    names[i] = s
    d[s] = i
  end
  return d,names
end

mkDict (generic function with 1 method)

### Model
$$
y_{ij} = \mu + sex_i + breed_j + \beta age_{ij} + e_{ij}
$$

We will use the `mkDict` function to construct the design matrix for this model given the data in DataFrame `df'.

In [9]:
dictA,namesA   = mkDict(A)

(Dict{Any,Any}("f"=>2,"m"=>1), ["m", "f"])

In [10]:
namesA

2-element Array{String,1}:
 "m"
 "f"

In [11]:
ii = 1:size(A,1)
jj = [dictA[i] for i in A]  #list comprehension 
[ii jj]
XA = sparse(ii,jj,1.0)

6×2 SparseMatrixCSC{Float64,Int64} with 6 stored entries:
  [1, 1]  =  1.0
  [4, 1]  =  1.0
  [2, 2]  =  1.0
  [3, 2]  =  1.0
  [5, 2]  =  1.0
  [6, 2]  =  1.0

In [12]:
B = df[:,:breed]
dictB,namesB   = mkDict(B)
jj   = [dictB[i] for i in B]  #list comprehension 
ii   = 1:size(B,1)
XB   = sparse(ii,jj,1.0)

6×2 SparseMatrixCSC{Float64,Int64} with 6 stored entries:
  [1, 1]  =  1.0
  [2, 1]  =  1.0
  [5, 1]  =  1.0
  [6, 1]  =  1.0
  [3, 2]  =  1.0
  [4, 2]  =  1.0

In [13]:
dictB

Dict{Any,Any} with 2 entries:
  "Angus"    => 1
  "Hereford" => 2

In [11]:
CVal = df[:,:age]
CStr = fill("age",size(CVal,1))  # only one column in design matrix for age

6-element Array{String,1}:
 "age"
 "age"
 "age"
 "age"
 "age"
 "age"

In [12]:
dictC,namesC   = mkDict(CStr)
jj   = [dictC[i] for i in CStr]  #list comprehension 
ii   = 1:size(CStr,1)
XC   = sparse(ii,jj,CVal)

6×1 SparseMatrixCSC{Int64,Int64} with 6 stored entries:
  [1, 1]  =  40
  [2, 1]  =  36
  [3, 1]  =  38
  [4, 1]  =  42
  [5, 1]  =  40
  [6, 1]  =  36

In [13]:
n = size(A,1)
Matrix([ones(n,1) XA XB XC])

6×6 Array{Float64,2}:
 1.0  1.0  0.0  1.0  0.0  40.0
 1.0  0.0  1.0  1.0  0.0  36.0
 1.0  0.0  1.0  0.0  1.0  38.0
 1.0  1.0  0.0  0.0  1.0  42.0
 1.0  0.0  1.0  1.0  0.0  40.0
 1.0  0.0  1.0  1.0  0.0  36.0

In [14]:
["intercept"; namesA; namesB; namesC]

6-element Array{Any,1}:
 "intercept"
 "m"        
 "f"        
 "Angus"    
 "Hereford" 
 "age"      

### Two-way model with interaction

The $\mathbf{X}$ matrix for the two-way model with interation between breed and sex

$$
y_{ijk} = \mu + sex_i + breed_j + sex_i*breed_j+ e_{ijk}
$$

We already have the design matrices for the main effects.

### Design Matrix for Interaction Term

#### Make vector of levels for interaction:

In [15]:
firstName = "Rohan"
lastName = " Fernando"
firstName * lastName

"Rohan Fernando"

In [16]:
AB = A .*" x ".*B

6-element Array{String,1}:
 "m x Angus"   
 "f x Angus"   
 "f x Hereford"
 "m x Hereford"
 "f x Angus"   
 "f x Angus"   

Use `AB` to construct `XAB` 

In [22]:
dictAB,namesAB   = mkDict(AB)
ii   = 1:size(A,1)
jj   = [dictAB[i] for i in AB]  #list comprehension 
XAB = sparse(ii,jj,1.0)

6×4 SparseMatrixCSC{Float64,Int64} with 6 stored entries:
  [1, 1]  =  1.0
  [2, 2]  =  1.0
  [5, 2]  =  1.0
  [6, 2]  =  1.0
  [3, 3]  =  1.0
  [4, 4]  =  1.0

In [23]:
namesAB

4-element Array{String,1}:
 "m x Angus"   
 "f x Angus"   
 "f x Hereford"
 "m x Hereford"

#### Design Matrix for Model

In [24]:
n = size(A,1)
Matrix([ones(n,1) XA XB XAB])

6×9 Array{Float64,2}:
 1.0  1.0  0.0  1.0  0.0  1.0  0.0  0.0  0.0
 1.0  0.0  1.0  1.0  0.0  0.0  1.0  0.0  0.0
 1.0  0.0  1.0  0.0  1.0  0.0  0.0  1.0  0.0
 1.0  1.0  0.0  0.0  1.0  0.0  0.0  0.0  1.0
 1.0  0.0  1.0  1.0  0.0  0.0  1.0  0.0  0.0
 1.0  0.0  1.0  1.0  0.0  0.0  1.0  0.0  0.0

In [19]:
["intercept"; namesA; namesB; namesAB]

9-element Array{Any,1}:
 "intercept"   
 "m"           
 "f"           
 "Angus"       
 "Hereford"    
 "m x Angus"   
 "f x Angus"   
 "f x Hereford"
 "m x Hereford"

### Model with sex-specific slope for age

$$
y_{ij} = \mu + sex_i + \beta_{i}(age_{ij})+ e_{ij}
$$

In [25]:
BVal = df[:,:age]
BStr = fill("age",size(BVal,1))

6-element Array{String,1}:
 "age"
 "age"
 "age"
 "age"
 "age"
 "age"

In [26]:
AB = A.*" x ".*BStr 

6-element Array{String,1}:
 "m x age"
 "f x age"
 "f x age"
 "m x age"
 "f x age"
 "f x age"

In [27]:
dAB,namesAB   = mkDict(AB)
ii    = 1:size(AB,1)
jj   = [dAB[i] for i in AB]  #list comprehension 
XAB   = sparse(ii,jj,BVal)
Matrix(XAB)

6×2 Array{Int64,2}:
 40   0
  0  36
  0  38
 42   0
  0  40
  0  36

#### Design Matrix for Model

In [28]:
Matrix([ones(n,1) XA XAB])

6×5 Array{Float64,2}:
 1.0  1.0  0.0  40.0   0.0
 1.0  0.0  1.0   0.0  36.0
 1.0  0.0  1.0   0.0  38.0
 1.0  1.0  0.0  42.0   0.0
 1.0  0.0  1.0   0.0  40.0
 1.0  0.0  1.0   0.0  36.0

In [29]:
["intercept"; namesA; namesAB]

5-element Array{String,1}:
 "intercept"
 "m"        
 "f"        
 "m x age"  
 "f x age"  

### Function to Construct Design Matrix

#### Function for main effects

We will do this by putting the code we have used earlier into a fucntion.
First, let's make our code work for quantitative or qualitiative factors.

In [35]:
# Test code for qualitative factor
factor = "sex"
cov    = false
data = df[:,Symbol(factor)]

n = size(data,1)
if cov==false
    str = data
    val = 1.0
else 
    str = fill(factor,n)
    val = data
end

dict,names   = mkDict(str)
ii    = 1:n                    # row numbers 
jj   = [dict[i] for i in str]  # column numbers
X    = sparse(ii,jj,val)
Matrix(X)    

6×2 Array{Float64,2}:
 1.0  0.0
 0.0  1.0
 0.0  1.0
 1.0  0.0
 0.0  1.0
 0.0  1.0

In [36]:
# Test code for quantitative factor
factor = "age"
cov    = true

data = df[:,Symbol(factor)]
n = size(data,1)
if cov==false
    str = data
    val = 1.0
else 
    str = fill(factor,n)
    val = data
end

dict,names   = mkDict(str)
ii    = 1:n                    # row numbers 
jj   = [dict[i] for i in str]  # column numbers
X    = sparse(ii,jj,val)
Matrix(X)

6×1 Array{Int64,2}:
 40
 36
 38
 42
 40
 36

#### Put the code in a function:

In [37]:
function getX(factor,df;cov=false)
    data = df[:,Symbol(factor)]
    n = size(data,1)
    if cov==false
        str = data
        val = 1.0
    else 
        str = fill(factor,n)
        val = data
    end

    dict,names   = mkDict(str)
    ii    = 1:n                    # row numbers 
    jj   = [dict[i] for i in str]  # column numbers
    X    = sparse(ii,jj,val)    
end        

getX (generic function with 1 method)

In [39]:
X = getX("sex",df)
Matrix(X)

6×2 Array{Float64,2}:
 1.0  0.0
 0.0  1.0
 0.0  1.0
 1.0  0.0
 0.0  1.0
 0.0  1.0

In [40]:
X = getX("age",df,cov=true)
Matrix(X)

6×1 Array{Int64,2}:
 40
 36
 38
 42
 40
 36

In [43]:
factors = ["sex", "breed"]
covs =[false, false]
n = size(df,1)

6

In [60]:
function getX(factors,covs,df)
    n = size(df,1)
    if covs[1] == false
        str = df[:,Symbol(factors[1])]
        val = 1.0
    else
        str = fill(factors[1],n) 
        val = df[:,Symbol(factors[1])]    
    end       

    for i in 2:length(factors)
        if covs[i] == false
            str = str .*" x ".*df[:,Symbol(factors[i])]
            val = val .* 1.0 
        else
            str = str .*" x ".*fill(factors[i],n) 
            val = val .* df[:,Symbol(factors[i])]    
        end 
    end 
    dict,names   = mkDict(str)
    ii    = 1:n                    # row numbers 
    jj   = [dict[i] for i in str]  # column numbers
    X    = sparse(ii,jj,val)    
    Matrix(X) 
end            

getX (generic function with 2 methods)

In [61]:
getX(factors,covs,df)

6×4 Array{Float64,2}:
 1.0  0.0  0.0  0.0
 0.0  1.0  0.0  0.0
 0.0  0.0  1.0  0.0
 0.0  0.0  0.0  1.0
 0.0  1.0  0.0  0.0
 0.0  1.0  0.0  0.0

In [62]:
factors = ["sex"]
covs =[false]
getX(factors,covs,df)

6×2 Array{Float64,2}:
 1.0  0.0
 0.0  1.0
 0.0  1.0
 1.0  0.0
 0.0  1.0
 0.0  1.0

In [66]:
factors = ["age", "age"]
covs =[true, true]
getX(factors,covs,df)

6×1 Array{Int64,2}:
 1600
 1296
 1444
 1764
 1600
 1296