# Preliminary Analysis

This document includes some preliminary results based on the simulated data (from Joey):
- Output of all files from _afflicted_test.zip_
- Summary statistics: allele frequencies
- Basic association analysis on the disease trait for all single SNPs

Simulation Data:
- Obtained from https://www.synapse.org/#!Synapse:syn22314894
- **Base population**: 4000 pedigrees, with 2-6 children generated by R package "msprime" (coalescent simulation method)
    - 2-6 children reflects the actual distirbution of the number of children within a typical US family
- **Whole genome**: 12500 variants, one arbitary chromosome
- Offsprings generated from recombination via a segmented loop that randomly selects a parents haplotype via chunks
    - The segments are 125 bp switches for the loop generating the offsprings
    - Every 125 bp a parent's haplotype is randomly chosen and each child gets their segment, which allows for children not to be identical (GRM confirmed).
- Each selected family must have >=2 individuals with the binary phenotype (V11)
    - After selection (>=2 cases), there are **5680 individuals** and **710 families**


In [1]:
# Current working directory
pwd

/Users/duz/Desktop/Center_for_Statistical_Genetics/family-association_DZ/analysis


In [2]:
# Change working directory to simulation data folder
cd ./afflicted_test

In [3]:
# 1) my_snps.txt: Selected SNPs that were given effects (MAF around 0.1)
head my_snps.txt

rs10085
rs10119
rs10120
rs10207
rs10233
rs10235
rs10269
rs10366
rs10368
rs10605


In [4]:
# 2) snp_eff.txt: Each snp was pulled from the generated data with its effect size
# Protective variant effect size = -0.693
# Susceptible variant effect size = 0.405
head snp_eff.txt

rs10085	0.405465108108164
rs10119	0.405465108108164
rs10120	0.405465108108164
rs10207	0.405465108108164
rs10233	0.405465108108164
rs10235	0.405465108108164
rs10269	0.405465108108164
rs10366	0.405465108108164
rs10368	0.405465108108164
rs10605	0.405465108108164


In [7]:
# 3) GRM file: six_fam_eff_affected_grm.sXX.txt 

In [8]:
# 4) SNPs not in LD with each selected SNPs
# Quality control check
head snps_not_in_ld_with_selected.txt

CHR_A	BP_A	SNP_A	CHR_B	BP_B	SNP_B	R2
1	10085	rs10085	1	1	rs1	3.47977e-05
1	10085	rs10085	1	2	rs2	5.82632e-05
1	10085	rs10085	1	3	rs3	5.49087e-05
1	10085	rs10085	1	4	rs4	0.000318135
1	10085	rs10085	1	5	rs5	0.000367395
1	10085	rs10085	1	6	rs6	3.86758e-05
1	10085	rs10085	1	7	rs7	8.34403e-05
1	10085	rs10085	1	8	rs8	8.59001e-06
1	10085	rs10085	1	9	rs9	8.59001e-06


In [9]:
# 5) six_fam_eff_sample_affected_fams.txt
# After subsetting by afflicted and the appropriate distribution of families
# FID and IID for >=2 cases and distributed families according to census info:
head six_fam_eff_sample_affected_fams.txt

2	id_3_1
2	id_4_1
3	id_5_1
3	id_6_1
9	id_17_1
9	id_18_1
10	id_19_1
10	id_20_1
2	id_22_1
3	id_23_1


In [10]:
# 6) six_fam_eff_phenotypes_structured_selected_fams.txt
# pop_id: population id
# V1: Continuous trait
# V11: Binary trait (0,1)
head six_fam_eff_phenotypes_structured_selected_fams.txt

IID	FID	pop_id	V1	V11
id_3_1	2	1	3.52914085178346	1
id_4_1	2	1	1.78691807704925	1
id_5_1	3	1	-1.26760779020004	1
id_6_1	3	1	2.27087740878414	1
id_17_1	9	1	5.17571703602393	0
id_18_1	9	1	-2.38575755858843	0
id_19_1	10	1	2.43594972380419	1
id_20_1	10	1	5.5551344722583	1
id_22_1	2	1	5.15389011434599	0


In [11]:
# 7) six_fam_eff_phenotypes_structured_one_selected_fams.txt
# Same as 6), except only V1 and V11 results are included
head six_fam_eff_phenotypes_structured_one_selected_fams.txt

3.52914085178346	1
1.78691807704925	1
-1.26760779020004	1
2.27087740878414	1
5.17571703602393	0
-2.38575755858843	0
2.43594972380419	1
5.5551344722583	1
5.15389011434599	0
0.515583680261195	0


In [12]:
# 8) six_fam_eff.assoc.txt
# Basic association test results with Gemma and GRM
head six_fam_eff.assoc.txt

chr	rs	ps	n_miss	allele1	allele0	af	beta	se	logl_H1	l_remle	l_mle	p_wald	p_lrt	p_score
1	rs2	2	0	C	A	0.025	-8.035558e-02	1.913898e-01	-1.109916e+04	5.403260e+00	5.403422e+00	6.746088e-01	6.745518e-01	6.745773e-01
1	rs4	4	0	C	A	0.128	6.971148e-02	1.536459e-01	-1.109915e+04	5.404674e+00	5.404446e+00	6.500511e-01	6.499817e-01	6.500029e-01
1	rs7	7	0	C	A	0.265	-2.162183e-02	9.290714e-02	-1.109922e+04	5.403634e+00	5.404270e+00	8.159832e-01	8.159403e-01	8.159494e-01
1	rs8	8	0	C	A	0.145	1.230853e-01	1.271866e-01	-1.109878e+04	5.401501e+00	5.401260e+00	3.332088e-01	3.331300e-01	3.332241e-01
1	rs9	9	0	C	A	0.145	1.230853e-01	1.271866e-01	-1.109878e+04	5.401501e+00	5.401260e+00	3.332088e-01	3.331300e-01	3.332241e-01
1	rs10	10	0	C	A	0.226	-9.550721e-03	8.990818e-02	-1.109924e+04	5.403859e+00	5.404329e+00	9.154056e-01	9.153798e-01	9.153852e-01
1	rs14	14	0	C	A	0.128	6.971148e-02	1.536459e-01	-1.109915e+04	5.404674e+00	5.404446e+00	6.500511e-01	6.499817e-01	6.500029e-01
1	rs15	15	0	C	A	0.405	4.720208e

In [13]:
# 9) Plink file: six_fam_eff_affected.bed

In [14]:
# 10) Plink file: six_fam_eff_affected.bim
head six_fam_eff_affected.bim

1	rs1	0	1	C	A
1	rs2	0	2	C	A
1	rs3	0	3	C	A
1	rs4	0	4	C	A
1	rs5	0	5	C	A
1	rs6	0	6	C	A
1	rs7	0	7	C	A
1	rs8	0	8	C	A
1	rs9	0	9	C	A
1	rs10	0	10	C	A


In [15]:
# 11) Plink file: six_fam_eff_affected.fam
# Columns: family individual_ID father mother sex affected_status(1=control, 2=case)
# only FID and IID are useful variables; all the rest are dummy variables for plink import format
head six_fam_eff_affected.fam 

2 id_3_1 0 0 1 1
2 id_4_1 0 0 1 1
3 id_5_1 0 0 1 1
3 id_6_1 0 0 1 1
9 id_17_1 0 0 1 1
9 id_18_1 0 0 1 1
10 id_19_1 0 0 1 1
10 id_20_1 0 0 1 1
2 id_22_1 0 0 1 1
3 id_23_1 0 0 1 1


### Summary statistics: allele frequencies

In [16]:
plink --bfile six_fam_eff_affected --pheno six_fam_eff_phenotypes_structured_selected_fams.txt --1 --freq --out freq_stat

PLINK v1.90p 64-bit (16 Jun 2020)              www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to freq_stat.log.
Options in effect:
  --1
  --bfile six_fam_eff_affected
  --freq
  --out freq_stat
  --pheno six_fam_eff_phenotypes_structured_selected_fams.txt

8192 MB RAM detected; reserving 4096 MB for main workspace.
12500 variants loaded from .bim file.
5680 people (5680 males, 0 females) loaded from .fam.
0 phenotype values present after --pheno.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 5680 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
--freq: Allele frequencies (founders only) written to freq_stat.frq .


In [17]:
head freq_stat.frq 

 CHR     SNP   A1   A2          MAF  NCHROBS
   1     rs1    C    A     0.002113    11360
   1     rs2    C    A      0.02535    11360
   1     rs3    C    A            0    11360
   1     rs4    C    A       0.1282    11360
   1     rs5    C    A    0.0007042    11360
   1     rs6    C    A     0.006338    11360
   1     rs7    C    A       0.2651    11360
   1     rs8    C    A       0.1447    11360
   1     rs9    C    A       0.1447    11360


### Basic association analysis on the disease trait for all single SNPs

In [18]:
plink --bfile six_fam_eff_affected --assoc --out as1 

PLINK v1.90p 64-bit (16 Jun 2020)              www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to as1.log.
Options in effect:
  --assoc
  --bfile six_fam_eff_affected
  --out as1

8192 MB RAM detected; reserving 4096 MB for main workspace.
12500 variants loaded from .bim file.
5680 people (5680 males, 0 females) loaded from .fam.
5680 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 5680 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
12500 variants and 5680 people pass filters and QC.
Among remaining phenotypes, 0 are cases and 5680 are controls.
Writing C/C --assoc report to as1.assoc ... 1624324049576573819098done.


### Adjusted for multiple testing

In [19]:
plink --bfile six_fam_eff_affected --assoc --adjust --out as2 

PLINK v1.90p 64-bit (16 Jun 2020)              www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to as2.log.
Options in effect:
  --adjust
  --assoc
  --bfile six_fam_eff_affected
  --out as2

8192 MB RAM detected; reserving 4096 MB for main workspace.
12500 variants loaded from .bim file.
5680 people (5680 males, 0 females) loaded from .fam.
5680 phenotype values loaded from .fam.
Using 1 thread (no multithreaded calculations invoked).
Before main variant filters, 5680 founders and 0 nonfounders present.
Calculating allele frequencies... 10111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989 done.
12500 variants and 5680 people pass filters and QC.
Among remaining phenotypes, 0 are cases and 5680 are controls.
Writing C/C --assoc report to as2.assoc ... 1624324049576573819098done.
--adjust: Gen

In [20]:
head as2.assoc

 CHR     SNP         BP   A1      F_A      F_U   A2        CHISQ            P           OR 
   1     rs1          1    C       NA 0.002113    A            0            1           NA 
   1     rs2          2    C       NA  0.02535    A            0            1           NA 
   1     rs3          3    C       NA        0    A           NA           NA           NA 
   1     rs4          4    C       NA   0.1282    A            0            1           NA 
   1     rs5          5    C       NA 0.0007042    A            0            1           NA 
   1     rs6          6    C       NA 0.006338    A            0            1           NA 
   1     rs7          7    C       NA   0.2651    A            0            1           NA 
   1     rs8          8    C       NA   0.1447    A            0            1           NA 
   1     rs9          9    C       NA   0.1447    A            0            1           NA 


In [21]:
head as2.assoc.adjusted

 CHR     SNP      UNADJ         GC       BONF       HOLM   SIDAK_SS   SIDAK_SD     FDR_BH     FDR_BY
   1 rs12500          1          1          1          1          1          1          1          1 
   1     rs2          1          1          1          1          1          1          1          1 
   1     rs4          1          1          1          1          1          1          1          1 
   1     rs5          1          1          1          1          1          1          1          1 
   1     rs6          1          1          1          1          1          1          1          1 
   1     rs7          1          1          1          1          1          1          1          1 
   1     rs8          1          1          1          1          1          1          1          1 
   1     rs9          1          1          1          1          1          1          1          1 
   1    rs10          1          1          1          1          1          1     