#### Model mieszany + estymacja parametrów wariancji + dodanie markerów genetycznych

Rozważmy model: 

$\mathbf{y} = \mathbf{X\beta+Za+\epsilon}$, gdzie

$\mathbf{y}$ - wektor z wartościami fenotypowmi ($n\times 1$ w naszym przypadku n = ilość zwierząt) 

$\mathbf{\beta}$ - wektor efektów stałych ($p\times 1$)

$\mathbf{a}$ - wektor efektów losowych ($q\times 1$)

$\mathbf{\epsilon}$ - wektor błędów ($n\times 1$)

$\mathbf{X}$ - macierz wystąpien dla efektów stałych ($n\times p$)

$\mathbf{Z}$ - macierz wystąpien dla efektów losowych ($n\times q$)

Zakładamy, że macierze $\mathbf{X}$ i $\mathbf{Z}$ są niezależne. Dodatkowo zakładamy, że:

$E(\mathbf{y}) = \mathbf{X\beta}$, $E(\mathbf{a})=E(\mathbf{\epsilon})=0$

$V=Var(\mathbf{y})=Var(\mathbf{Za})+ Var(\mathbf{\epsilon})=\mathbf{Z}Var(\mathbf{a})\mathbf{Z}^{T}+\mathbf{I}Var(\mathbf{\epsilon})\mathbf{I}^{T}=\mathbf{ZGZ}^{T}+\mathbf{R}$, gdzie

$\mathbf{G}=\mathbf{A}\cdot\sigma^{2}_{a}$, $\mathbf{R}=\mathbf{I}\cdot\sigma^{2}_{\epsilon}$

$ \left[ \begin{array}{cc}
         \mathbf{X}^{T}\mathbf{X} & \mathbf{X}^{T}\mathbf{Z} \\
         \mathbf{Z}^{T}\mathbf{X} & \mathbf{Z}^{T}\mathbf{Z}+\mathbf{A}^{-1}\alpha \end{array}\right]\cdot
  \left[ \begin{array}{c}
         \widehat{\mathbf{\beta}} \\
         \widehat{\mathbf{a}} \end{array}\right]=
  \left[ \begin{array}{c}
         \mathbf{X}^{T}\mathbf{y} \\
         \mathbf{Z}^{T}\mathbf{y} \end{array}\right]$, gdzie

$\alpha=\frac{\sigma^{2}_{\epsilon}}{\sigma^{2}_{a}}$

$ \left[ \begin{array}{c}
         \widehat{\mathbf{\beta}} \\
         \widehat{\mathbf{a}} \end{array}\right]=
  \left[ \begin{array}{cc}
         \mathbf{X}^{T}\mathbf{X} & \mathbf{X}^{T}\mathbf{Z} \\
         \mathbf{Z}^{T}\mathbf{X} & \mathbf{Z}^{T}\mathbf{Z}+\mathbf{A}^{-1}\alpha \end{array}\right]^{-1}\cdot
  \left[ \begin{array}{c}
         \mathbf{X}^{T}\mathbf{y} \\
         \mathbf{Z}^{T}\mathbf{y} \end{array}\right]$

$ C = \left[ \begin{array}{cc}
         \mathbf{X}^{T}\mathbf{X} & \mathbf{X}^{T}\mathbf{Z} \\
         \mathbf{Z}^{T}\mathbf{X} & \mathbf{Z}^{T}\mathbf{Z}+\mathbf{A}^{-1}\alpha \end{array}\right] = 
         \left[ \begin{array}{cc}
         \mathbf{C}_{11} & \mathbf{C}_{12} \\ \mathbf{C}_{21} & \mathbf{C}_{22} \end{array}\right]$

$ C^{-1} = \left[ \begin{array}{cc}
         \mathbf{C}^{11} & \mathbf{C}^{12} \\ \mathbf{C}^{21} & \mathbf{C}^{22} \end{array}\right]$

Dokładność oszacowania wynosi: $r^2 = diag(1-\mathbf{C}^{22}\cdot\alpha$)

1. Liczymy macierz A

In [1]:
library(pedigreemm)

id = 1:8
sire = c(NA, NA, NA, 1, 3, 1, 4, 3)
dam = c(NA, NA, NA, NA, 2, 2, 5, 6)

cbind(id, sire, dam)

"package 'pedigreemm' was built under R version 3.6.3"Loading required package: lme4
"package 'lme4' was built under R version 3.6.3"Loading required package: Matrix


id,sire,dam
1,,
2,,
3,,
4,1.0,
5,3.0,2.0
6,1.0,2.0
7,4.0,5.0
8,3.0,6.0


In [2]:
(ped = pedigree(sire = sire, dam = dam, label = id))

  sire  dam
1 <NA> <NA>
2 <NA> <NA>
3 <NA> <NA>
4    1 <NA>
5    3    2
6    1    2
7    4    5
8    3    6

In [3]:
(A = as.matrix(getA(ped)))

1,2,3,4,5,6,7,8
1.0,0.0,0.0,0.5,0.0,0.5,0.25,0.25
0.0,1.0,0.0,0.0,0.5,0.5,0.25,0.25
0.0,0.0,1.0,0.0,0.5,0.0,0.25,0.5
0.5,0.0,0.0,1.0,0.0,0.25,0.5,0.125
0.0,0.5,0.5,0.0,1.0,0.25,0.5,0.375
0.5,0.5,0.0,0.25,0.25,1.0,0.25,0.5
0.25,0.25,0.25,0.5,0.5,0.25,1.0,0.25
0.25,0.25,0.5,0.125,0.375,0.5,0.25,1.0


2. Definiujemy kolejne źródła informacji

In [4]:
y = as.matrix(c(4.5, 2.9, 3.9, 3.5, 5.0))
t(y)

sex = c(1, 0, 0, 1, 1)
X = matrix(0, 5, 2)
X[,1] = sex
X[,2] = 1-sex
X

model.matrix(~factor(sex) - 1)

0,1,2,3,4
4.5,2.9,3.9,3.5,5


0,1
1,0
0,1
0,1
1,0
1,0


factor(sex)0,factor(sex)1
0,1
1,0
1,0
0,1
0,1


In [5]:
I = diag(5)
Z = matrix(0, 5, 8)
Z[1:5, 4:8] = I
Z

0,1,2,3,4,5,6,7
0,0,0,1,0,0,0,0
0,0,0,0,1,0,0,0
0,0,0,0,0,1,0,0
0,0,0,0,0,0,1,0
0,0,0,0,0,0,0,1


3. Rozwiązujemy układ równan mieszanych

In [6]:
library(MASS)

mme = function(y, X, Z, A, sigma_a, sigma_e) {
    alpha = sigma_e / sigma_a
    invA = ginv(A)
    C = rbind(cbind(t(X)%*%X, t(X)%*%Z),
              cbind(t(Z)%*%X, t(Z)%*%Z+invA*c(alpha)))
    rhs = rbind(t(X)%*%y, t(Z)%*%y)
    invC = ginv(C)
    estimators = invC%*%rhs
    list(C = C, est = estimators)
}

mme(y, X, Z, A, 20, 40)

0,1,2,3,4,5,6,7,8,9
3,0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0
0,2,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
0,0,3.666667,1.0,5.107026e-15,-1.333333,8.881784e-16,-2.0,-1.054712e-14,-4.385381e-15
0,0,1.0,4.0,1.0,-3.552714e-15,-2.0,-2.0,2.220446e-15,-7.979728e-16
0,0,8.881784e-16,1.0,4.0,1.332268e-15,-2.0,1.0,-2.220446e-15,-2.0
1,0,-1.333333,-2.442491e-15,-2.997602e-15,4.666667,1.0,-3.330669e-16,-2.0,2.400857e-15
0,1,-4.440892e-16,-2.0,-2.0,1.0,6.0,-1.998401e-15,-2.0,-1.137979e-15
0,1,-2.0,-2.0,1.0,3.330669e-16,-2.220446e-16,6.0,1.554312e-15,-2.0
1,0,3.330669e-16,1.332268e-15,-2.220446e-16,-2.0,-2.0,1.110223e-15,5.0,-9.78384e-16
1,0,-6.245005e-16,-6.356027e-15,-2.0,-8.396062e-15,6.938894000000001e-17,-2.0,7.188694e-15,5.0

0
4.35850233
3.404430006
0.098444576
-0.018770099
-0.041084203
-0.008663123
-0.185732099
0.176872088
-0.249458555
0.182614688


4. Dokładność wartości hodowlanych

In [7]:
C = as.matrix(mme(y, X, Z, A, 20, 40)$C)
(invC = ginv(C))

invC22 = invC[3:10, 3:10]
(r2 = diag(1 - invC22*2))
(r = sqrt(r2))

0,1,2,3,4,5,6,7,8,9
0.59556097,0.15730213,-0.164119413,-0.083624565,-0.13059083,-0.26455749,-0.14827804,-0.16632621,-0.2842464,-0.237879
0.15730213,0.80245865,-0.13286326,-0.241250738,-0.1119684,-0.08730803,-0.29891465,-0.30600266,-0.1859495,-0.1986488
-0.16411941,-0.13286326,0.471094211,0.006928037,0.03264668,0.21954371,0.04495225,0.22077427,0.1386223,0.1341923
-0.08362457,-0.24125074,0.006928037,0.492095721,-0.01030797,0.02039033,0.23734577,0.24515571,0.1198194,0.110664
-0.13059083,-0.1119684,0.032646682,-0.010307967,0.45645878,0.04812709,0.20132326,0.02261354,0.1258983,0.2177471
-0.26455749,-0.08730803,0.219543709,0.020390333,0.04812709,0.42768015,0.0470442,0.12757186,0.2428012,0.1231911
-0.14827804,-0.29891465,0.044952254,0.23734577,0.20132326,0.0470442,0.42810675,0.16972255,0.219716,0.1780739
-0.16632621,-0.30600266,0.220774267,0.245155707,0.02261354,0.12757186,0.16972255,0.44228277,0.152183,0.2192238
-0.28424641,-0.1859495,0.138622268,0.119819354,0.12589831,0.24280124,0.21971599,0.15218301,0.4418562,0.1680818
-0.23787901,-0.19864885,0.134192262,0.110664009,0.2177471,0.12319108,0.17807393,0.21922376,0.1680818,0.4223641


5. Istotność efektów stałych

Test Walda:

$W = \frac{\widehat{\beta}}{se(\widehat{\beta})}\sim\mathcal{N}(0,1)$

$H_{0}: \beta=0$ vs. $H_{1}: \beta\neq 0$

$var(\widehat{\beta}) = \left(\mathbf{X}^{T}\mathbf{V}^{-1}\mathbf{X}\right)^{-1}$

In [8]:
G = A*20
R = diag(5)*40
V = Z%*%G%*%t(Z) + R
V

(varB = ginv(t(X)%*%ginv(V)%*%X))
(seB = sqrt(diag(varB)))

(testWalda = mme(y, X, Z, A, 20, 40)$est[1:2] / seB)
(p_value = 2*pnorm(abs(testWalda), lower.tail = FALSE))

0,1,2,3,4
60.0,0.0,5,10,2.5
0.0,60.0,5,10,7.5
5.0,5.0,60,5,10.0
10.0,10.0,5,60,5.0
2.5,7.5,10,5,60.0


0,1
23.822439,6.292085
6.292085,32.098346


6. Estymacja parametrów wariancji

In [9]:
var(y)

0
0.678


In [10]:
sigma_a = 100.01  #starting value for random effect
sigma_e = 9.01 #starting value for error variance

In [11]:
EM = function(y, X, Z, A, sigma_a, sigma_e) {
  n = nrow(X)
  p = ncol(X) 
  q = nrow(A) 
  
  t = 1 #iteration number 1
  tmp = 0.1 #test for convergance
  
  while (tmp > 0.0001) {
    mme_new = mme(y, X, Z, A, sigma_a, sigma_e)
    C_new = ginv(mme_new$C)
    Ck = C_new[(p+1):(p+q), (p+1):(p+q)]
    mme2 = mme_new$est
    
    a = as.matrix(mme2[(p+1):(p+q)])
    sigma_a_new = (t(a)%*%ginv(A)%*%a + sum(diag(ginv(A)%*%Ck))*c(sigma_e))/q
    
    res = as.matrix(y-X%*%as.matrix(mme2[1:p]) - Z%*%as.matrix(mme2[(p+1):(p+q)]))
    X.tmp1 = cbind(X,Z) %*% C_new
    X.tmp2 = t(cbind(X,Z))
    sigma_e_new = (t(res)%*%res + sum(diag(X.tmp1%*%X.tmp2))*c(sigma_e))/n
    
    tmp = max(abs(sigma_a - sigma_a_new), abs(sigma_e - sigma_e_new))
    sigma_a = sigma_a_new
    sigma_e = sigma_e_new
    
    t = t + 1
  }
  list(t = t, sigma_a = sigma_a, sigma_e = sigma_e)
}

#### $\sigma^{2}_{\epsilon[t+1]} = \frac{\widehat{\epsilon}^{'}_{[t]}\widehat{\epsilon}_{[t]} + tr([X, Z]C^{22}_{[t]}[X, Z]^{'})\cdot\sigma^{2}_{\epsilon[t]}}{n}$

#### $\sigma^{2}_{a[t+1]} = \frac{\widehat{a}^{'}_{[t]}A^{-1}\widehat{a}_{[t]} + tr(A^{-1}C^{22}_{[t]})\cdot\sigma^{2}_{\epsilon[t]}}{q}$

In [12]:
(wyniki = EM(y, X, Z, A, sigma_a, sigma_e))
wyniki$sigma_a+wyniki$sigma_e
var(y)

0
0.6554876

0
0.01804659


0
0.6735342


0
0.678


7. Odziedziczalność

In [13]:
(h2 = 0.6555448 / (0.6555448 + 0.01799635))

In [14]:
mme(y, X, Z, A, wyniki$sigma_a, wyniki$sigma_e)$est

0
4.43770328
3.48232496
0.27627795
-0.1297969
-0.08639019
0.04800268
-0.57425926
0.40960934
-0.90249762
0.5413851


8. Test Walda z nowymi $\sigma^{2}_{a}$ i $\sigma^{2}_{\epsilon}$

In [15]:
G = A*c(wyniki$sigma_a)
R = diag(5)*c(wyniki$sigma_e)
V = Z%*%G%*%t(Z) + R
V

(varB = ginv(t(X)%*%ginv(V)%*%X))
(seB = sqrt(diag(varB)))

(testWalda = mme(y, X, Z, A, wyniki$sigma_a, wyniki$sigma_e)$est[1:2] / seB)
(p_value = 2*pnorm(abs(testWalda), lower.tail = FALSE))

0,1,2,3,4
0.67353423,0.0,0.1638719,0.3277438,0.08193596
0.0,0.6735342,0.1638719,0.3277438,0.24580787
0.16387191,0.1638719,0.6735342,0.1638719,0.32774382
0.32774382,0.3277438,0.1638719,0.6735342,0.16387191
0.08193596,0.2458079,0.3277438,0.1638719,0.67353423


0,1
0.342414,0.2026743
0.2026743,0.3633771


9. Dokładność oceny

In [16]:
C = as.matrix(mme(y, X, Z, A, wyniki$sigma_a, wyniki$sigma_e)$C)
invC = ginv(C)

invC22 = invC[3:10, 3:10]
alpha = wyniki$sigma_e / wyniki$sigma_a
(r2 = diag(1 - invC22*c(alpha)))
(r = sqrt(r2))

10. Macierz z genotypami

In [17]:
Genotypes = sample(c("AA", "AB", "BB"), 5000, replace = TRUE)
head(Genotypes)

In [18]:
Genotypes = as.data.frame(matrix(Genotypes, 5, 1000))

In [19]:
head(Genotypes)

V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V991,V992,V993,V994,V995,V996,V997,V998,V999,V1000
AA,AA,AA,BB,BB,BB,AB,BB,BB,AB,...,AB,AB,AA,AA,BB,BB,AB,BB,AB,BB
AA,BB,BB,BB,AA,BB,AB,BB,BB,BB,...,BB,AB,AB,BB,AA,BB,BB,AB,BB,AB
AB,AB,BB,AB,AA,AB,AA,BB,AB,BB,...,AA,AA,BB,AA,AA,BB,BB,BB,BB,BB
AA,AA,BB,AA,AA,AB,AA,AB,AB,BB,...,AB,BB,BB,AA,AB,AB,AA,AA,AB,AB
BB,BB,AA,AB,AA,BB,AA,AB,AA,AB,...,BB,BB,AB,AB,BB,AB,BB,AB,AA,AA


11. Macierz G

$\mathbf{G} = \frac{\textbf{MM}^{T}}{2\sum_{i=1}^{N_{SNP}}p_{i}(1-p_{i})}$,  gdzie $\textbf{M}_{ij}\in\{2-2p_{i}, 1-2p_{i}, -2p_{i}\}$ dla genotypu $SNP_{i} = \{AA, AB, BB\}$ i osobnika $j$.

In [20]:
tmp = data.frame(matrix(0, 3, 1))
colnames(tmp) = c("Var1")
tmp[, 1] = c("AA", "AB", "BB")
tmp = merge(tmp, as.data.frame(table(Genotypes[, 1])), by = "Var1", all.x = TRUE)
tmp[is.na(tmp)] = 0

In [21]:
tmp

Var1,Freq
AA,3
AB,1
BB,1


In [22]:
p = matrix(0, 3, 1000)
for (i in 1:1000) {
    tmp = data.frame(matrix(0, 3, 1))
    colnames(tmp) = c("Var1")
    tmp[, 1] = c("AA", "AB", "BB")
    tmp = merge(tmp, as.data.frame(table(Genotypes[, i])), by = "Var1", all.x = TRUE)
    tmp[is.na(tmp)] = 0
    p[, i] = tmp$Freq
}

In [23]:
p2 = numeric(1000)
for (i in 1:1000) {
    p2[i] = (2*p[1, i] + p[2, i]) / 10
}

In [24]:
head(p2)

In [25]:
(k = which(p2 == 0))
p2 = p2[-k]

In [26]:
(k2 = which(p2 == 1))
p2 = p2[-k2]

In [27]:
Genotypes2 = Genotypes[, -c(k, k2)]
dim(Genotypes2)

In [28]:
M = matrix(0, 5, length(p2))
for (i in 1:5) {
    for (j in 1:length(p2)) {
       if (Genotypes2[i, j] == "AA") {
           M[i, j] = 2 - 2*p2[j]
       } else if (Genotypes2[i, j] == "AB") {
           M[i, j] = 1 - 2*p2[j]
       } else {
           M[i, j] = - 2*p2[j]
       }
    }
}

In [29]:
G = M%*%t(M) / (2*sum(p2*(1-p2)))

In [30]:
G

0,1,2,3,4
1.2468381,-0.277407,-0.2755402,-0.2862743,-0.2792738
-0.277407,1.2249032,-0.3063425,-0.2144024,-0.2844075
-0.2755402,-0.3063425,1.2543053,-0.2918747,-0.2382041
-0.2862743,-0.2144024,-0.2918747,1.2748402,-0.3399449
-0.2792738,-0.2844075,-0.2382041,-0.3399449,1.2678396


12. Wrzucenie macierzy G do modelu

$\mathbf{y} = \mathbf{X\beta+Z_{1}a+Z_{2}g+\epsilon}$

In [32]:
mme2 = function(y, X, Z1, Z2, A, G, sigma_a, sigma_g, sigma_e) {
    alpha1 = sigma_e / sigma_a
    alpha2 = sigma_e / sigma_g
    invA = ginv(A)
    invG = ginv(G)
    C = rbind(cbind(t(X)%*%X, t(X)%*%Z1, t(X)%*%Z2),
              cbind(t(Z1)%*%X, t(Z1)%*%Z1+invA*c(alpha1), t(Z1)%*%Z2),
              cbind(t(Z2)%*%X, t(Z2)%*%Z1, t(Z2)%*%Z2 + invG*c(alpha2)))
    rhs = rbind(t(X)%*%y, t(Z1)%*%y, t(Z2)%*%y)
    invC = ginv(C)
    estimators = invC%*%rhs
    list(C = C, est = estimators)
}

(est = mme2(y, X, diag(5), diag(5), A[4:8, 4:8], G, 10, 30, 40)$est)

0
4.34130496
3.401214252
-0.003452949
-0.055133673
0.054219329
-0.072283409
0.054390036
0.087512029
-0.246672434
0.245158275


13. Efekty markerów SNP

$\widehat{SNP} = \left(\left(\frac{1-k}{N_{SNP}}\right)^{-1} + \frac{1}{k}\textbf{MA}^{-1}\textbf{M}^{T} \right)^{-1}\frac{1}{k}\textbf{MA}^{-1}\widehat{g}$

In [33]:
k = 0.2
snp = ginv((t(M)%*%ginv(A)[4:8, 4:8]%*%M)/k + 1 / ((1-k)/length(p2)) ) / k 

In [34]:
snp = snp%*%t(M)%*%ginv(A)[4:8, 4:8]

In [35]:
snp = snp %*% est[8:12]

In [36]:
head(snp)

0
-0.0013010329
-0.0006039244
0.0011944542
-0.0003928861
-0.0002694417
-0.0002621532


14. Czy efekty SNP są statystycznie istotne?

In [40]:
(s = sd(snp))
(m= mean(snp))

In [41]:
snp2 = snp / s
sd(snp2)

In [43]:
W = snp2
p.value = 2*pnorm(abs(W), lower.tail = FALSE)

In [45]:
length(which(p.value < 0.05))

15. Estymacja efektów markerów - metoda GBLUP

$\mathbf{y} = \mathbf{X\beta+Z_{1}a+Z_{2}g+\epsilon}$

$Z_{2} = \{-1, 0, 1\}$ odpowiednio dla genotypów $\{AA, AB, BB\}$

Efekt $\mathbf{g}\sim\mathcal{N}(0, \mathbf{I}\cdot\frac{\sigma^{2}_{a}}{N_{SNP}})$

In [46]:
Z2 = matrix(0, 5, length(p2))
for (i in 1:5) {
    for (j in 1:length(p2)) {
       if (Genotypes2[i, j] == "AA") {
           Z2[i, j] = -1
       } else if (Genotypes2[i, j] == "AB") {
           Z2[i, j] = 0
       } else {
           Z2[i, j] = 1
       }
    }
}

In [47]:
head(Z2)

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
-1,-1,-1,1,1,1,0,1,1,0,...,0,0,-1,-1,1,1,0,1,0,1
-1,1,1,1,-1,1,0,1,1,1,...,1,0,0,1,-1,1,1,0,1,0
0,0,1,0,-1,0,-1,1,0,1,...,-1,-1,1,-1,-1,1,1,1,1,1
-1,-1,1,-1,-1,0,-1,0,0,1,...,0,1,1,-1,0,0,-1,-1,0,0
1,1,-1,0,-1,1,-1,0,-1,0,...,1,1,0,0,1,0,1,0,-1,-1


In [50]:
sigma_a = 0.6555448 
sigma_e = 0.01799635
sigma_g = sigma_a / length(p2)
G = diag(length(p2))

In [51]:
res2 = mme2(y, X, diag(5), Z2, A[4:8, 4:8], G, sigma_a, sigma_g, sigma_e)

In [52]:
dim(res2$C)

In [55]:
est_snp2 = res2$est[8:999]
head(est_snp2)

In [57]:
(s = sd(est_snp2))
(m= mean(est_snp2))

In [58]:
est_snp2 = est_snp2 / s
sd(est_snp2)

In [59]:
W2 = est_snp2
p.value2 = 2*pnorm(abs(W2), lower.tail = FALSE)

In [61]:
length(which(p.value2 < 0.05))

In [62]:
which(p.value < 0.05)
which(p.value2 < 0.05)

1. Wygenerować $10\ 000$ markerów na osobnika.
2. Włączyć markery do modelu na jeden z dwóch sposobów.
3. Wyestymować parametry wariancji (jeśli potrzebne).
4. Ocenić które markery SNP są statycznie istotne.

16. Model wielocechowy

Mamy $\textbf{y}_{1} = \textbf{X}_{1}\beta_{1}+\textbf{Z}_{1}\textbf{a}_{1}+\epsilon_{1}$ oraz $\textbf{y}_{2} = \textbf{X}_{2}\beta_{2}+\textbf{Z}_{2}\textbf{a}_{2}+\epsilon_{2}$

$ \left[ \begin{array}{c}
         \mathbf{y}_{1} \\
         \mathbf{y}_{2}\end{array}\right] = 
  \left[ \begin{array}{cc}
         \mathbf{X}_{1} & \mathbf{0} \\
         \mathbf{0} & \mathbf{X}_{2} \end{array}\right]\cdot
  \left[ \begin{array}{c}
         \mathbf{\beta}_{1} \\
         \mathbf{\beta}_{2} \end{array}\right] +
  \left[ \begin{array}{cc}
         \mathbf{Z}_{1} & \mathbf{0} \\
         \mathbf{0} & \mathbf{Z}_{2} \end{array}\right]\cdot
  \left[ \begin{array}{c}
         \mathbf{a}_{1} \\
         \mathbf{a}_{2} \end{array}\right] +
  \left[ \begin{array}{c}
         \mathbf{\epsilon}_{1} \\
         \mathbf{\epsilon}_{2}\end{array}\right]$

Korelacja pomiedzy efektami

$var\left[ \begin{array}{c}
         \mathbf{a}_{1} \\
         \mathbf{a}_{2} \\ 
         \mathbf{\epsilon}_{1} \\
         \mathbf{\epsilon}_{2}
         \end{array}\right] = 
    \left[ \begin{array}{cccc}
         \mathbf{g}_{11}\mathbf{A} & \mathbf{g}_{12}\mathbf{A} & \mathbf{0} & \mathbf{0} \\
         \mathbf{g}_{21}\mathbf{A} & \mathbf{g}_{22}\mathbf{A} & \mathbf{0} & \mathbf{0} \\ 
         \mathbf{0} & \mathbf{0} & \mathbf{r}_{11}\mathbf{I} & \mathbf{r}_{12}\mathbf{I} \\
         \mathbf{0} & \mathbf{0} & \mathbf{r}_{21}\mathbf{I} & \mathbf{r}_{22}\mathbf{I}
         \end{array}\right] $

Układ równań mieszanych

$ \left[ \begin{array}{cc}
         \mathbf{X}^{T}\mathbf{R}^{-1}\mathbf{X} & \mathbf{X}^{T}\mathbf{R}^{-1}\mathbf{Z} \\
         \mathbf{Z}^{T}\mathbf{R}^{-1}\mathbf{X} & \mathbf{Z}^{T}\mathbf{R}^{-1}\mathbf{Z}+\mathbf{A}^{-1}\otimes \mathbf{G}^{-1} \end{array}\right]\cdot
  \left[ \begin{array}{c}
         \widehat{\mathbf{\beta}} \\
         \widehat{\mathbf{a}} \end{array}\right]=
  \left[ \begin{array}{c}
         \mathbf{X}^{T}\mathbf{R}^{-1}\mathbf{y} \\
         \mathbf{Z}^{T}\mathbf{R}^{-1}\mathbf{y} \end{array}\right]$, gdzie

$ \mathbf{y} = \left[ \begin{array}{c}
         \mathbf{y}_{1} \\
         \mathbf{y}_{2}\end{array}\right]$, $ \mathbf{X} = \left[ \begin{array}{cc}
         \mathbf{X}_{1} & \mathbf{0} \\
         \mathbf{0} & \mathbf{X}_{2} \end{array}\right]$, 
$ \mathbf{\beta} = \left[ \begin{array}{c}
         \mathbf{\beta}_{1} \\
         \mathbf{\beta}_{2} \end{array}\right]$, 
$ \mathbf{Z} =  \left[ \begin{array}{cc}
         \mathbf{Z}_{1} & \mathbf{0} \\
         \mathbf{0} & \mathbf{Z}_{2} \end{array}\right]$,
$ \mathbf{a} = \left[ \begin{array}{c}
         \mathbf{a}_{1} \\
         \mathbf{a}_{2} \end{array}\right]$,
$ \mathbf{\epsilon} = \left[ \begin{array}{c}
         \mathbf{\epsilon}_{1} \\
         \mathbf{\epsilon}_{2}\end{array}\right]$

Macierz A

In [4]:
library(pedigreemm)

id = 1:8
sire = c(NA, NA, NA, 1, 3, 1, 4, 3)
dam = c(NA, NA, NA, NA, 2, 2, 5, 6)

cbind(id, sire, dam)

id,sire,dam
1,,
2,,
3,,
4,1.0,
5,3.0,2.0
6,1.0,2.0
7,4.0,5.0
8,3.0,6.0


In [5]:
(ped = pedigree(sire = sire, dam = dam, label = id))
(A = as.matrix(getA(ped)))

  sire  dam
1 <NA> <NA>
2 <NA> <NA>
3 <NA> <NA>
4    1 <NA>
5    3    2
6    1    2
7    4    5
8    3    6

1,2,3,4,5,6,7,8
1.0,0.0,0.0,0.5,0.0,0.5,0.25,0.25
0.0,1.0,0.0,0.0,0.5,0.5,0.25,0.25
0.0,0.0,1.0,0.0,0.5,0.0,0.25,0.5
0.5,0.0,0.0,1.0,0.0,0.25,0.5,0.125
0.0,0.5,0.5,0.0,1.0,0.25,0.5,0.375
0.5,0.5,0.0,0.25,0.25,1.0,0.25,0.5
0.25,0.25,0.25,0.5,0.5,0.25,1.0,0.25
0.25,0.25,0.5,0.125,0.375,0.5,0.25,1.0


Pozostałe źródła informacji

In [15]:
y1 = as.matrix(c(4.5, 2.9, 3.9, 3.5, 5.0))
y2 = as.matrix(c(6.8, 5.0, 6.8, 6.0, 7.5))

(G = matrix(c(20, 18, 18, 40), 2, 2))
(R = matrix(c(40, 11, 11, 30), 2, 2))

0,1
20,18
18,40


0,1
40,11
11,30


In [9]:
sex = c(1, 0, 0, 1, 1)
X1 = matrix(0, 5, 2)
X1[, 1] = sex
X1[, 2] = 1-sex
X2 = X1

In [10]:
I = diag(5)
Z1 = matrix(0, 5, 8)
Z1[1:5, 4:8] = I
Z2 = Z1

In [35]:
library(MASS)

mme3 = function(y, X, Z, A, G, R) {
    invA = ginv(A)
    invG = ginv(G)
    R = kronecker(R, diag(5))
    invR = ginv(R)
    C = rbind(cbind(t(X)%*%invR%*%X, t(X)%*%invR%*%Z),
              cbind(t(Z)%*%invR%*%X, t(Z)%*%invR%*%Z+kronecker(invG, invA)))
    rhs = rbind(t(X)%*%invR%*%y, t(Z)%*%invR%*%y)
    invC = ginv(C)
    estimators = invC%*%rhs
    list(C = C, est = estimators)
}

In [20]:
y = as.matrix(c(y1, y2))

X = matrix(0, 10, 4)
X[1:5, 1:2] = X1
X[6:10, 3:4] = X2

Z = matrix(0, 10, 16)
Z[1:5, 1:8] = Z1
Z[6:10, 9:16] = Z2
Z

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1


In [36]:
results = mme3(y, X, Z, A, G, R)

In [37]:
results

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0.08341057,0.0,-0.03058387,0.0,0.0,0.0,0.0,0.02780352,0.0,0.0,0.02780352,0.02780352,0.0,0.0,0.0,-0.01019462,0.0,0.0,-0.01019462,-0.01019462
0.0,0.05560704,0.0,-0.02038925,0.0,0.0,0.0,0.0,0.02780352,0.02780352,0.0,0.0,0.0,0.0,0.0,0.0,-0.01019462,-0.01019462,0.0,0.0
-0.03058387,0.0,0.11121409,0.0,0.0,0.0,0.0,-0.01019462,0.0,0.0,-0.01019462,-0.01019462,0.0,0.0,0.0,0.03707136,0.0,0.0,0.03707136,0.03707136
0.0,-0.02038925,0.0,0.07414272,0.0,0.0,0.0,0.0,-0.01019462,-0.01019462,0.0,0.0,0.0,0.0,0.0,0.0,0.03707136,0.03707136,0.0,0.0
0.0,0.0,0.0,0.0,0.1540616,0.04201681,2.145809e-16,-0.05602241,4.664803e-17,-0.08403361,-4.01173e-16,-1.810527e-16,-0.06932773,-0.01890756,-9.656141e-17,0.02521008,-2.0991610000000003e-17,0.03781513,1.805279e-16,8.147369000000001e-17
0.0,0.0,0.0,0.0,0.04201681,0.1680672,0.04201681,-1.119553e-16,-0.08403361,-0.08403361,7.463684e-17,1.2245110000000001e-17,-0.01890756,-0.07563025,-0.01890756,5.0379870000000003e-17,0.03781513,0.03781513,-3.358658e-17,-5.510298e-18
0.0,0.0,0.0,0.0,6.997204000000001e-17,0.04201681,0.1680672,4.1983220000000006e-17,-0.08403361,0.04201681,-3.2653620000000006e-17,-0.08403361,-3.1487420000000003e-17,-0.01890756,-0.07563025,-1.8892450000000003e-17,0.03781513,-0.01890756,1.469413e-17,0.03781513
0.02780352,0.0,-0.01019462,0.0,-0.05602241,-1.119553e-16,-1.819273e-16,0.1818651,0.04201681,8.396645000000001e-17,-0.08403361,8.046785000000001e-17,0.02521008,5.0379870000000003e-17,8.186729000000001e-17,-0.07952236,-0.01890756,-3.7784900000000006e-17,0.03781513,-3.6210530000000004e-17
0.0,0.02780352,0.0,-0.01019462,-9.329605e-18,-0.08403361,-0.08403361,0.04201681,0.2378876,3.731842e-17,-0.08403361,-1.778456e-16,4.198322e-18,0.03781513,0.03781513,-0.01890756,-0.1047324,-1.679329e-17,0.03781513,8.003052000000001e-17
0.0,0.02780352,0.0,-0.01019462,-0.08403361,-0.08403361,0.04201681,-2.332401e-17,-7.463684e-17,0.2378876,5.1312830000000004e-17,-0.08403361,0.03781513,0.03781513,-0.01890756,1.049581e-17,3.358658e-17,-0.1047324,-2.3090770000000003e-17,0.03781513

0
4.360866999
3.397261592
6.79989762
5.880295937
0.150915567
-0.01539251
-0.078391896
-0.010238959
-0.270331441
0.275808258


Dokładność oceny wielocechowej

In [48]:
C = as.matrix(mme3(y, X, Z, A, G, R)$C)
invC = ginv(C)

invC22 = invC[5:20, 5:20]

trait1 = diag(invC22)[1:8]
trait2 = diag(invC22)[9:16]

(r2_1 = (20-trait1) / 20)
(r2_2 = (40-trait2) / 40)

(r1 = sqrt(r2_1))
(r2 = sqrt(r2_2))

Podpowiedź do projektu

0. Określamy ilość analizowanych danych (w zależności od wydajności komputera i optymalności kodu).

* Możemy wyrzucić z analizy trochę (nawet dużo) wierszy, ale należy to zrobić tak, aby pozostawione osobniki były ze sobą spokrewnione.
* Na podsawie MAF (frekwencji rzadszego allelu) selekcje SNP (Wyznaczamy frekwencję allelu A dla każdego z marekrów jeżeli jest ona < 0.5 to ją zostawiamy, a wprzeciwnym wypadku obliczamy 1-0.5; wyrzucamy z anlizy markeru, które mają MAF < 5%).
* Obliczamy niezbędne źródła informacji (macierz $A$ oraz decydujemy, czy dołączamy efekty markerów SNP i ewentualnie obliczamy macierz $G$ - jeżeli uzywamy macierzy $G$ to do estymacji efektów SNP używamy metody backsolve).
* Możemy również zdecydować się na macierz $H$, która jest połączeniem macierzy $A$ oraz macierzy $G$.
* Na podstawie danych decydujemy jakie efekty stałe będą występowały w modelu.

1. Wybieramy ilość analizowanych cech (jedna lub dwie)

* Czy wybieramy model wielo-, czy jednocechowy?
* Czy estymujemy parametry wariancji, czy uznajemy je za znane? W przypadku braku estymacji opieramy się na wariancji $y$.

2. Na podstawie modelu:

* estymujemy wartości hodowlane wraz z ich dokładnościami i ewentualnie porównujemy je pomiędzy modelem jedno- lub dwucechowym;
* wyestymować efekty markerów SNP oraz określić, które z markerów są statystycznie istotne;
* określić istotność efektów stałych w modelu;
* określić odziedziczalność każdej z cech na podstawie odpowiedniego modelu.


In [None]:
dane = read.table("http://theta.edu.pl/wp-content/uploads/2022/05/daneProjekt.csv", header = TRUE, sep = ";")

In [None]:
daneG = dane[, 8:173]
