# GWAS 明文实现

GWAS的目的就是给定若干病例基因样本，通过建立模型，研究不同的基因型与疾病的关联性。

得到该分析的模型之后，可以使用该模型来预测某基因型发病的可能性

GWAS主要包括以下算法：

- 数据预处理：
    - SNP过滤

- 关联分析：
    - PCA + 逻辑回归
    - 优势比
    - 趋势卡方计算（检验）

测试数据大小为 126行 × 24列，其中，第1列为表现型(y)，其余23列为不同的位点(x)；一共有126条样本

- *在数据的第一列 (y) 中，1/0表示是否为病例；其余位置上的值表示某基因型(如 a)的个数，0表示个数为0或缺失*

- *注：在下面的分析中，取前120条样本用于建立模型，剩下的6条样本用于检验*

- *假定测试数据不含缺失值，即0表示的基因型为AA*

In [1]:
import numpy as np
import pandas as pd
from sklearn import decomposition

import warnings
warnings.filterwarnings('ignore')

# 读入数据
data_file = './data/pca-data.txt'
data = pd.read_csv(data_file, sep='\t', index_col=False)
data.set_index('Unnamed: 0', inplace=True)
data.head()  # 前5条数据

Unnamed: 0_level_0,Case(1)/Control(0),rs11252546,rs7909677,rs10904494,rs11591988,rs4508132,rs10904561,rs7917054,rs7906287,rs4495823,...,rs9419560,rs9419561,rs11253562,rs4881551,rs4880750,rs11594819,rs9419419,rs7909028,rs7476951,rs12146291
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
NA18532,1,0,0,0,1,1,1,0,1,1,...,0,0,0,1,1,0,0,0,0,1
NA18605,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
NA18542,1,0,0,0,2,2,2,0,2,1,...,0,0,0,1,1,1,0,1,1,1
NA18550,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
NA18608,1,0,1,0,1,1,0,1,1,1,...,0,0,0,1,0,2,0,0,0,0


## 1. SNP过滤

- **最小基因频率 MAF**

| SNP | 碱基A | 碱基a |
|:---:|:---:|:---:|
|Case组|$a$|$b$|
|Control组|$c$|$d$|

假设碱基总数为$2N$，则
$A(\%) = \frac{a+c}{2N}$, $a(\%) = \frac{b+d}{2N}$

- **HW平衡测试**

|        |   AA   |   Aa   |   aa   |
|:-:|:-:|:-:|:-:|
|实际总数| $h$ | $i$ | $j$ |
|期望总数| $MP^2(A)=k$ | $MP(A)P(a)=l$ | $MP^2(a)=m$ |
*其中，M为样本总数*

$T = \frac{(h-k)^2}{k^2} + \frac{(i-l)^2}{l^2} + \frac{(j-m)^2}{m^2}$

以上面的数据中的SNP1(`rs11252546`)为例，计算MAF

In [2]:
case_list = data.loc[data.iloc[:, 0] == 1].iloc[:, 1]
control_list = data.loc[data.iloc[:, 0] == 0].iloc[:, 1]
print("case样本个数: {}，control样本个数: {}".format(case_list.shape[0], control_list.shape[0]))

# A的个数为： 2*(AA的个数) + 1*(Aa)的个数；
# a的个数为： 2*(aa的个数) + 1*(Aa)的个数； 或 2*样本个数 - A的个数
case_A = sum(np.array(case_list==0)) * 2 + sum(np.array(case_list==1))
case_a = 2 * case_list.shape[0] - case_A

control_A = sum(np.array(control_list==0)) * 2 + sum(np.array(control_list==1))
control_a = 2 * control_list.shape[0] - control_A

print(
"""           碱基A   碱基a
Case组       {}     {}
Control组    {}     {}
""".format(case_A, case_a, control_A, control_a))

# 则A(%)与a(%)为：
rate_A = (case_A + control_A) / (2*data.shape[0])
rate_a = (case_a + control_a) / (2*data.shape[0])
print("A(%): {}\na(%): {}".format(rate_A, rate_a))

case样本个数: 62，control样本个数: 64
           碱基A   碱基a
Case组       120     4
Control组    123     5

A(%): 0.9642857142857143
a(%): 0.03571428571428571


以上面的数据中的SNP1(`rs11252546`)为例，计算HW平衡测试

In [3]:
CA = lambda h,i,j,k,l,m: (h-k)**2/k + (i-l)**2/l + (j-m)**2/m

h_count = (data.iloc[:, 1] == 2).sum()  # AA
i_count = (data.iloc[:, 1] == 1).sum()  # Aa
j_count = (data.iloc[:, 1] == 0).sum()  # aa

# A和a的占比上面已经算出，接下来算期望值
k_count = len(data) * rate_A ** 2
l_count = len(data) * rate_A * rate_a
m_count = len(data) * rate_a ** 2
k_count, l_count, m_count = int(k_count), int(l_count), int(m_count)
print('''
实际：\t{}\t{}\t{}
期望：\t{}\t{}\t{}
'''.format(h_count, i_count, j_count, k_count, l_count, m_count))

ca_val = CA(h_count, i_count, j_count, k_count, l_count, m_count+1) # m_count=0，+1(仅用于测试)
print("ca_val: ", ca_val)


实际：	0	9	117
期望：	117	4	0

ca_val:  13579.25


---

## 2.1.1 PCA

PCA步骤（设有$m$条$n$维数据）：

1. 将原始数据按列组成$n$行$m$列矩阵$X$
2. 将X的每一行（代表一个属性字段）进行零均值化，即减去这一行的均值
3. 求出协方差矩阵
4. 求出协方差矩阵的特征值及对应的特征向量$r$
5. 将特征向量按对应特征值大小从上到下按行排列成矩阵，取前$k$行组成矩阵$P$
6. 即为降维到$k$维后的数据

### 例:

*测试数据有126条样本，每条样本都有23个特征值（即矩阵$X$的大小为： m=126行 n=23列）*

In [4]:
data.head() # 前5条样本，其中第一列为是否为病例，进行PCA时应剔除掉第一列

Unnamed: 0_level_0,Case(1)/Control(0),rs11252546,rs7909677,rs10904494,rs11591988,rs4508132,rs10904561,rs7917054,rs7906287,rs4495823,...,rs9419560,rs9419561,rs11253562,rs4881551,rs4880750,rs11594819,rs9419419,rs7909028,rs7476951,rs12146291
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
NA18532,1,0,0,0,1,1,1,0,1,1,...,0,0,0,1,1,0,0,0,0,1
NA18605,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
NA18542,1,0,0,0,2,2,2,0,2,1,...,0,0,0,1,1,1,0,1,1,1
NA18550,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
NA18608,1,0,1,0,1,1,0,1,1,1,...,0,0,0,1,0,2,0,0,0,0


---
- 调用sklearn库实现的PCA算法，其中：
  - `n_compoents=5`为取前5个最大的主成分。（指定降维后的维数为5）
  - 该库实现的PCA算法，默认输入的数据中，每一行代表一个样本，因此直接调用`pca.fit()`即可

In [5]:
pca = decomposition.PCA(n_components=5)
pca.fit(data.iloc[:120, 1:])  # 每一行代表一个样本
ex_ratio = pca.explained_variance_ratio_  # 每个主成分占方差比例

print("主成分占方差比例(ex_ratio): {},\n主成分之和(fit_ratio): {}\n".format(ex_ratio, sum(ex_ratio)))
print("主成分分组(components) : shape: {}".format(pca.components_.shape))
pd.DataFrame(pca.components_)

主成分占方差比例(ex_ratio): [0.68002706 0.13521191 0.05426511 0.03935531 0.02754393],
主成分之和(fit_ratio): 0.9364033069486133

主成分分组(components) : shape: (5, 23)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,13,14,15,16,17,18,19,20,21,22
0,0.013274,0.004536,0.013274,0.332443,0.348467,0.292939,0.033153,0.326093,0.33816,0.287632,...,0.007677,0.007677,0.013274,0.324938,0.305059,0.042437,0.001485,0.031691,0.023976,0.250886
1,0.20279,0.12472,0.20279,0.171857,0.126092,-0.227521,0.389545,0.162024,0.115267,-0.260079,...,0.05855,0.05855,0.20279,-0.083299,-0.270054,0.53725,0.06807,0.186421,0.191301,-0.171149
2,-0.137876,0.230119,-0.137876,0.108235,0.086796,-0.048939,0.187667,0.138728,0.077292,-0.156735,...,0.128137,0.128137,-0.137876,0.183684,-0.141859,-0.614671,0.130763,0.107377,0.098424,-0.482468
3,-0.214839,0.105593,-0.214839,-0.198353,0.051258,-0.157663,-0.085525,-0.243187,0.085847,0.272553,...,0.019864,0.019864,-0.214839,0.324755,0.19544,0.475642,0.029806,-0.046433,-0.119006,-0.493429
4,-0.154834,0.287157,-0.154834,-0.08228,-0.020966,-0.358477,0.22209,-0.136386,0.069448,-0.218078,...,0.112256,0.112256,-0.154834,0.302955,-0.073969,0.015418,0.110964,-0.18073,-0.187196,0.601401


从上面可以看到，主成分分组是一个$5\times23$的矩阵($U$)，如要进行降为操作，计算$XU^T$即可。
如：对测试数据(6条样本)进行降维

In [6]:
data.iloc[-6:, 1:] # 输入数据大小为n×23，此处n=6

Unnamed: 0_level_0,rs11252546,rs7909677,rs10904494,rs11591988,rs4508132,rs10904561,rs7917054,rs7906287,rs4495823,rs2379076,...,rs9419560,rs9419561,rs11253562,rs4881551,rs4880750,rs11594819,rs9419419,rs7909028,rs7476951,rs12146291
Unnamed: 0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
NA18642,0,0,0,2,2,2,0,2,2,1,...,0,0,0,2,2,0,0,0,0,2
NA18599,0,0,0,1,1,1,0,1,1,1,...,0,0,0,1,1,0,0,0,0,1
NA18626,0,1,0,2,2,1,1,2,2,1,...,0,0,0,2,1,1,0,0,0,1
NA18595,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
NA18618,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
NA18740,0,0,0,1,1,1,0,1,1,1,...,0,0,0,1,1,0,0,0,0,1


In [7]:
out_x = np.dot(data.iloc[-6:, 1:], pca.components_.T)  # 输出数据大小为n×5，此处n=6
# pd.DataFrame(pca.transform(data.iloc[-6:, 1:]))
pd.DataFrame(out_x.astype(np.float16))

Unnamed: 0,0,1,2,3,4
0,5.996094,-0.349365,-0.069519,-0.483398,0.385742
1,3.140625,-0.304688,-0.113159,-0.105408,0.083862
2,5.226562,1.371094,0.406738,0.468018,0.741699
3,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0
5,3.140625,-0.304688,-0.113159,-0.105408,0.083862


因此可以计算得到，原始数据的降为结果为：

---
- 下面为自主实现的PCA算法（基于numpy），根据上面描述的PCA的步骤，实现PCA的函数如下：

In [8]:
def my_pca(data_mat, topNfeat=9999999):
    mean_vals = np.mean(data_mat, axis=0)                 # 求均值
    mean_removed = data_mat - mean_vals                   # 2. 去平均值

    cov_mat = np.cov(mean_removed, rowvar=0)              # 3. 求协方差矩阵(若rowvar非0，一列代表一个样本)

    eig_vals, eig_vects = np.linalg.eig(np.mat(cov_mat))  # 求特征值&特征向量

    eig_val_ind = np.argsort(eig_vals)                    # 从小到大对N个值排序，返回其索引
    eig_val_ind = eig_val_ind[:-(topNfeat+1):-1]
    red_eig_vects = eig_vects[:, eig_val_ind]

    low_d_data_mat = np.dot(mean_removed, red_eig_vects)  # 将数据转换到新空间
    # 重构数据
    recon_mat = (low_d_data_mat * red_eig_vects.T) + mean_vals[:, np.newaxis].T

    # 返回PCA降维后的结果，重构的结果，以及主成分
    return low_d_data_mat, recon_mat, red_eig_vects

In [9]:
low_d_data_mat, recon_mat, red_eig_vects = my_pca(data.iloc[:120, 1:], topNfeat=5)

print("主成分： shape: {}".format(red_eig_vects.shape))
pd.DataFrame(red_eig_vects.astype(np.float16))

主成分： shape: (23, 5)


Unnamed: 0,0,1,2,3,4
0,-0.013275,0.202759,-0.137817,-0.214844,-0.154785
1,-0.004536,0.124695,0.230103,0.105591,0.287109
2,-0.013275,0.202759,-0.137817,-0.214844,-0.154785
3,-0.33252,0.171875,0.108215,-0.198364,-0.082275
4,-0.348389,0.126099,0.086792,0.05127,-0.020966
5,-0.292969,-0.227539,-0.04895,-0.157715,-0.358398
6,-0.033142,0.389648,0.187622,-0.08551,0.222046
7,-0.326172,0.161987,0.138672,-0.243164,-0.136353
8,-0.338135,0.115295,0.077271,0.085876,0.069458
9,-0.287598,-0.26001,-0.156738,0.272461,-0.218018


可以看到，主成分分组$U$的维度为($23\times5$)，因此计算$XU$的结果则为降为后的结果$((120\times23)\times(23\times5))$

*注：不同的PCA算法实现的代码规范可能不一样，因此具体实现时需要先进行验算*

接下来计算降为后的结果和重构的结果

In [10]:
# PCA降维结果
# low_d_data_mat = pd.DataFrame(low_d_data_mat)
test_low_d_data_mat = np.dot(data.iloc[-6:, 1:], red_eig_vects)
pd.DataFrame(test_low_d_data_mat.astype(np.float16))

Unnamed: 0,0,1,2,3,4
0,-5.996094,-0.349365,-0.069519,-0.483398,0.385742
1,-3.140625,-0.304688,-0.113159,-0.105408,0.083862
2,-5.226562,1.371094,0.406738,0.468018,0.741699
3,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0
5,-3.140625,-0.304688,-0.113159,-0.105408,0.083862


In [11]:
# 重构后的数据(前5条)
test_mean_vals = np.mean(data.iloc[-6:, 1:], axis=0)
recon_mat = np.dot(test_low_d_data_mat, red_eig_vects.T) + test_mean_vals[:, np.newaxis].T
pd.DataFrame(recon_mat.astype(np.float32))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,13,14,15,16,17,18,19,20,21,22
0,0.062428,0.194033,0.062428,2.989506,3.006034,2.610252,0.343298,2.95355,2.966857,2.276832,...,0.050367,0.050367,0.062428,2.924172,2.643302,0.052109,0.004434,0.070097,0.055355,2.90127
1,0.005163,0.129829,0.005163,1.993674,2.039204,1.814946,0.158516,1.973462,2.015141,1.62014,...,-0.000901,-0.000901,0.005163,2.016477,1.863124,0.156935,-0.024706,0.020328,0.00273,1.83063
2,0.075922,0.71732,0.075922,2.863162,3.037712,1.69284,1.074957,2.767798,3.048376,1.715373,...,0.265057,0.265057,0.075922,3.035308,2.036273,1.108858,0.250509,0.309113,0.233088,1.928639
3,0.0,0.166667,0.0,1.0,1.0,0.833333,0.166667,1.0,1.0,0.666667,...,0.0,0.0,0.0,1.0,0.833333,0.166667,0.0,0.0,0.0,0.833333
4,0.0,0.166667,0.0,1.0,1.0,0.833333,0.166667,1.0,1.0,0.666667,...,0.0,0.0,0.0,1.0,0.833333,0.166667,0.0,0.0,0.0,0.833333
5,0.005163,0.129829,0.005163,1.993674,2.039204,1.814946,0.158516,1.973462,2.015141,1.62014,...,-0.000901,-0.000901,0.005163,2.016477,1.863124,0.156935,-0.024706,0.020328,0.00273,1.83063


## 2.1.2 逻辑回归（待补充）

前面PCA主要是为逻辑回归作准备，若输入数据的维度太大（上百万个SNP）的时候，$Y=\theta^TX$中的参数$\theta$将难以求解，因此在逻辑回归之前进行降维也是必要的

In [12]:
# 调用sklearn实现逻辑回归
from sklearn.linear_model import LogisticRegression
LR = LogisticRegression(C=1.0, penalty='l1', tol=0.01)

# 使用原始数据(23个特征)
LR.fit(data.iloc[:120, 1:], data.iloc[:120, 1])
predict = LR.predict(data.iloc[120:, 1:])
grount_truth = data.iloc[120:, 1].tolist()
print("预测结果：{}\n实际结果：{}".format(predict, grount_truth))

预测结果：[0 0 0 0 0 0]
实际结果：[0, 0, 0, 0, 0, 0]


In [13]:
# 使用降维后的数据(5个特征)
LR.fit(low_d_data_mat, data.iloc[:120, 1])
predict = LR.predict(test_low_d_data_mat)  # 同样需要使用降维后的数据来预测
grount_truth = data.iloc[120:, 1].tolist()
print("预测结果：{}\n实际结果：{}".format(predict, grount_truth))

预测结果：[0 0 0 0 0 0]
实际结果：[0, 0, 0, 0, 0, 0]


---

## 2.2 优势比

|SNP      | 碱基A | 碱基a |
|:-------:|:----:|:-----:|
|Case组   |a     |b      |
|Control组|c     |d      |

对于次要等位碱基T来说：$T = \frac{bc}{da}$

OR 一般用于计算次要等位基因：
- OR值=1，表示该因素对疾病的发生不起作用。
- OR值>1，表示该因素是一个危险因素
- OR值<1，表示该因素是一个保护因素

In [14]:
""" 直接使用1.snp过滤中的统计结果来计算
            碱基A   碱基a
Case组       120     4
Control组    123     5
"""
T = lambda a,b,c,d: (b*c)/(a*d)

a = 120
b = 4
c = 123
d = 5

test_T = T(a, b, c, d)
print("T={}".format(test_T))

T=0.82


---

## 2.3 趋势卡方计算（检验）

### 原理：

如下面$2\times3$的列联表。其中，$R_1=N_{11}+N_{12}+N_{13}$，$C_1=N_{11}+N_{21}$，etc.

|         |B=1      |B=2      |B=3      |sum      |
|:-------:|:-------:|:-------:|:-------:|:-------:|
|A=1      |$N_{11}$ |$N_{12}$ |$N_{13}$ |$R_1$    |
|A=2      |$N_{21}$ |$N_{22}$ |$N_{23}$ |$R_2$    |
|sum      |$C_1$    |$C_2$    |$C_3$    |$N$      |

首先计算：①$T=\sum_{i=1}^k t_{i}(N_{1i}R_2 - N_{2i}R_1)$

然后计算②方差: $Var(T) = \frac{R_1 R_2}{N}(\sum_{i=1}^{k} t_i^2 C_i(N-C_i) - 2\sum_{i=1}^{k-1}\sum_{j=i+1}^{k}t_i t_j C_i C_j)$

最后计算: ③  $\frac{T}{\sqrt{Var(T)}}$

注：对于上面权重 $t$ 的选择，GWAS中一般选择$(1, 1, 0)$ / $(0, 1, 1)$ / $(0, 1, 2)$

### 例子:

假设有如下列联表：

|         | Genotype aa | Genotype Aa | Genotype AA | Sum |
|:-------:|:-----------:|:-----------:|:-----------:|:---:|
|Controls | 20          | 20          | 20          |60   |
|Cases    | 10          | 20          | 30          |60   |
|Sum      | 30          | 40          | 50          |120  |

计算得到：
- T = 600
- Var(T) = 105000.0
- score = 1.85...

*详见以下程序实现*

In [15]:
import math
import numpy as np
class CA_test_for_trend:
    def __init__(self, n_1, n_2, k=3, t=[1, 1, 0]):
        assert len(n_1) == k and len(n_2) == k, "len(n) must equals to k"
        assert len(t) == k, "len(k) must equals to k"
        self.k = k
        self.t = t
        self.table = np.array([n_1, n_2])
        self.R = self.table.sum(axis=1)
        self.C = self.table.sum(axis=0)
        self.N = self.table.sum()
    
    # 计算 T 值
    def cal_T(self):
        T = 0
        for i in range(self.k):
            T += self.t[i] * ( self.table[0][i] * self.R[1] - self.table[1][i] * self.R[0])
        return T

    # 计算 T 的方差
    def cal_varT(self):
        part_1 = self.R[0] * self.R[1] / self.N
        part_2_left = 0
        for i in range(self.k):
            part_2_left += self.t[i]**2 * self.C[i] * (self.N - self.C[i])

        part_2_right = 0
        for i in range(self.k):
            for j in range(i+1, self.k):
                part_2_right += self.t[i] * self.t[j] * self.C[i] * self.C[j]
        else:
            part_2_right *= 2

        return part_1 * (part_2_left - part_2_right)
    
    def cal_score(self, T, var_T):
        return T / math.sqrt(var_T)

# ex:
Controls = [20, 20, 20]
Cases = [10, 20, 30]
K = 3
t = [1, 1, 0]
myca = CA_test_for_trend(Controls, Cases, K, t)
T = myca.cal_T()
var_T = myca.cal_varT()
score = myca.cal_score(T, var_T)
print("T: {}\nVar(T): {}\nscore: {}".format(T, var_T, score))

T: 600
Var(T): 105000.0
score: 1.8516401995451028


In [16]:
# case组:
case_aa = sum(case_list==2)  # aa
case_Aa = sum(case_list==1)  # Aa
case_AA = sum(case_list==0)  # AA
# control组:
control_aa = sum(control_list==2)
control_Aa = sum(control_list==1)
control_AA = sum(control_list==0)
print('''
           aa  Aa  AA
Case组     {}   {}  {}
Control组  {}   {}  {}
'''.format(case_aa, case_Aa, case_AA, control_aa, control_Aa, control_AA))

controls = [control_aa, control_Aa, control_AA]
cases = [case_aa, case_Aa, case_AA]
# 调用上面定义的函数进行趋势卡方检验
myca = CA_test_for_trend(controls, cases, k=3, t=[1, 1, 0])
T = myca.cal_T()
var_T = myca.cal_varT()
score = myca.cal_score(T, var_T)
print("T: {}\nVar(T): {}\nscore: {}".format(T, var_T, score))


           aa  Aa  AA
Case组     0   4  58
Control组  0   5  59

T: 54
Var(T): 33161.142857142855
score: 0.29653708566750747
