# **Power of family-based association methods with fixed sample size**

This notebook investigates the impact of sample relatedness on the power of finding associations in family-based methods.

## **Aim**

This simulation study compares **family-based vs population-based** methods of association for complex traits in terms of:
1. **Improvement in power:** family-based studies benefit from the fact that the effect size of the causal variants is expected to be large, however it is also known that unaffected pedigree members could be enriched for causal variants making them less than perfect controls compared to population controls where no casual variant enrichment is expected. 
    * So one question would be: Do family-based methods benefit from large effect sizes in complex traits? Will the increase in power trade-off the loss due to sample relatedness? Is the causal variant effect size in fact larger in families with complex traits? How much?


## **Method**

* Simulation of pedigrees and genotype data using SeqSIMLA. Also simulate cases-controls
* Compare pedigree unaffected versus population unaffected using GLMM (SAIGE software)
* Compare family-based vs population-based desing using GLMM using different strategies:
1. Analyze families using GLLM both affected and unaffected
2. Analyze affected individuals in the families and replaced the unaffected members with family members who were generated but no one is affected in their family.
3. Analyze affected family members and replaced the unaffected members with population-based controls (take one unaffected from a family which was simulated with no affected individuals)
4. Analyze a sample of unrelated individuals with the same number of affected and unaffected individuals as in analysis 1-3.

## **Hypothesis**

For lower ORs there should be a higher probability that unaffected family members are susceptibility variant carriers than when the OR is higher

## **Data**

1. **Pedigree and genotype simulation parameters:** MAF=0.1, OR=2.0, prevalence=0.1
2. **Pedigrees:** two-generational families each of them with two or more affected individuals. Families simulated following the proportion of children in USA families as of 2019. For more info see: https://www.census.gov/data/tables/2019/demo/families/cps-2019.html
3. **Sample size:** Try starting with 5000 pedigrees
4.**Genetic model:** additive
5. **GLMM:** Assess power with an alpha of 5x10-8

# **SeqSIMLA:** simulation of multigenerational pedigrees with phenotype and genotype data

Run SeqSIMLA using a variant with a MAF=0.1 and OR=2.0, additionally disease prevalence of 10%. For this I will use as input the reference sequence file `EUR_500.bed.gz`, the recombination file `EUR_500K.rec`, the simulated pedigree file `simped1000.txt` and the proband file `proband.txt`, which consist of all of the offspring being affected for the trait. The script `monitor.py` was used to understand memory usage.

In [None]:
[global]
# Name for the outputfiles
parameter: header = 'Sim1'
# Reference sequence file
parameter: popfile = 'EUR_chr1.bed.gz'
# Recombination file
parameter: recfile = 'EUR_chr1.rec'
# Disease prevalence
parameter: prev = 0.1
# Odd ratio
parameter: OR = 2.0
# model --mode-prev for simulating disease status based on prevalence and OR
parameter: model = '--mode-prev'
# Path to the ped file (6-column PED in linkage format)
parameter: famfile = path('simped1000.ped')
# Output directory
parameter: folder = path('results')
# Select location of disease sites
parameter: site = 7319
#The number of simulated replicas to generate
parameter: batch = 100

In [None]:
[simulate (SeqSIMLA)]
bash: container = 'dianacornejo/seqsimla' , expand = '${ }'
    mkdir -p ${folder}
    SeqSIMLA -popfile ${popfile} -recfile ${recfile} -famfile ${famfile} -folder ${folder} -header ${header} -batch ${batch} -site ${site} ${model} -prev ${prev} -or ${OR}

In [None]:
python /usr/local/bin/monitor.py SeqSIMLA -popfile EUR_500K.bed.gz -recfile EUR_500K.rec -famfile simped1000.txt -proband proband.txt -folder results -header Sim1 -batch 1 -site 7319 --mode-prev -prev 0.1 -or 2.0

The output files are `Sim21.ped`, `Sim2.map`, `Sim2.freq`and `Sim2_result.txt`. In the `Sim21.ped` the first 6 columns correspond to the pedigree file in linkage format and from the 7 the snps compose of 2 alleles. 

Run this command line to test the program with example data for asian population.

In [None]:
python /usr/local/bin/monitor.py SeqSIMLA -popfile ASN_500k.bed.gz -recfile ASN_500k.rec -famfile SAP.txt -proband probands.txt -folder test1 -header test -batch 1 -site 1,200,3000 --mode-prev -prev 0.05 -or 1.2 

Here, the `ref_file.bed.gz` was simulated to have 4 variant sites (the minimum accepted by SeqSIMLA) and 1000 individuals were used. The selected disease site was #1

In [None]:
convert ref_file.txt ref_file.bed && gzip ref_file.bed
python /usr/local/bin/monitor.py SeqSIMLA -popfile ref_file.bed.gz -famfile simped100.txt -proband proband100.txt -folder results -header Sim2 -batch 1 -site 1 --mode-prev -prev 0.1 -or 2.0

Make the simulation with 200 families. This will make possible to choose 100 unrelated cases and 100 unrelated controls (in this case the unaffected parents of the second 100 set of families)

In [None]:
python /usr/local/bin/monitor.py SeqSIMLA -popfile ref_file.bed.gz -famfile simped200.txt -proband proband200.txt -folder results_200 -header Sim2 -batch 1 -site 1 --mode-prev -prev 0.1 -or 2.0