# Simulation of genotypic and phenotypic data using SeqSIMLA

## Input files

1. **Reference sequence file:** this file contains a population of sequences (haplotypes). This is a binary compressed file (`*.gz`) 
2. **Recombination fraction file:** this file specifies the recombination fractions for locations in the reference sequences. **Optional**
3. **Pedigree structure file:** in this file you should specify the family structures to be simulated. Format is the same as a PLINK `.fam` file
4. **Proband file:** has the same format as the pedigree structure file. In this file you can specify the affection status of an individual and SeqSIMLA will simulate the same affection status inputed here. 
5. **Penetrance file:** in this file you can specify the penetrance for each genotype. **Optional**
6. **Site file:** this file is used to specified the disease sites but you can also use the -site option. In this file you can also specify multiple pairwise interactions.

## Output files

1. **map file:** formatted as a PLINK map file
2. **ped file:** in LINKAGE format. First six columns are the same as the pedigree structure file then every two columns correspond to one marker.
3. **phe file:**

In [2]:
pwd

The following R code was to generate the ref_file.txt assuming that there were spaces in the output of the python code in the `generate_refseq.ipynb`. Right now the code can be used as it is.

In [5]:
#dat = read.table('seqsimla/input/ref_file.txt')
#mat = cbind(rbinom(1000,1,0.1),rbinom(1000,1,0.1),rbinom(1000,1,0.1),rbinom(1000,1,0.1))

In [6]:
#apply(mat,2,mean)

In [19]:
#write.table(mat, 'mat.txt',col.names=F,row.names=F,sep='')

In this step the `ref_file.txt` needs to be converted to a binary file to be read by SeqSIMLA

### Global parameter setting

In [None]:
[global]
# The reference sequence file
parameter: ref_file = 'ref_file.txt'
# 

In [21]:
./convert ${ref_file}.txt ${ref_file}.bed && rm -f ${ref_file}.bed.gz && gzip ${ref_file}.bed


Reading the reference file...
1000 haplotypes with 4 bp have been read

Writing the binary reference file...



In [8]:
SeqSIMLA -popfile mat.bed.gz -famfile simped100.txt -proband proband100.txt -folder results -header Sim2 -batch 1 -site 1  --mode-prev -prev 0.1 -or 2.0

Simulation started: Thu Mar  5 14:52:47 2020

SeqSIMLA version:		 2.9.1 31August2017
Reference file:			 mat.bed.gz
Recombination file:		 
Family file:			 simped100.txt
Proband file:			 proband100.txt
Disease sites:			 1 
Odds ratios:			 2 
Disease mode:			 prevalence

Reading the reference file...
4 haplotypes with 1000 bp have been read

Calculating disease or QTL allele frequencies...
results exists. New Folder name: results_20200305_145247
Disease minor allele frequencies for the selected sites:
1 : 0	


Estimating the baseline penetrance...
The estimated disease prevalence is 0.109917
A total of 0 recombination rates >= 0.1 were read...
Launched 1 threads successfully.



Simulation of the pedigrees with one affected child

In [1]:
SeqSIMLA -popfile mat.bed.gz -famfile simped_one_child.txt -proband simped_one_child.txt -folder results -header sim_one_child -batch 1 -site 1  --mode-prev -prev 0.1 -or 2.0

bash: SeqSIMLA: command not found


: 127

In [17]:
rbinom(1000,1,0.1)

In [1]:
res = read.table('results_20200305_145734/Sim21.ped')

In [4]:
geno = res[,-(1:6)] -1 

In [7]:
apply(geno,2,mean)

In [8]:
head(res)

Unnamed: 0_level_0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14
Unnamed: 0_level_1,<fct>,<fct>,<fct>,<fct>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1,FAM1,M1,0,0,2,1,1,1,1,1,1,1,1,1
2,FAM1,F1,0,0,1,1,1,2,1,1,1,1,1,2
3,FAM1,O2,F1,M1,1,2,1,1,1,1,1,1,1,1
4,FAM1,O1,F1,M1,2,2,2,1,1,1,1,1,2,1
5,FAM2,M2,0,0,2,2,2,2,1,1,1,1,1,1
6,FAM2,F2,0,0,1,1,1,1,1,1,1,1,1,1


In [9]:
head(geno)

Unnamed: 0_level_0,V7,V8,V9,V10,V11,V12,V13,V14
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,0,0,0,0,0,0,0,0
2,0,1,0,0,0,0,0,1
3,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,1,0
5,1,1,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0


In [12]:
sum(geno$V13) / nrow(geno)

In [13]:
getwd()

In [19]:
ped_file = 'results_20200305_145734/Sim21.ped'

In [23]:
lines = [x.strip().split() for x in open(ped_file).readlines()]

In [24]:
len(lines)

470

In [25]:
lines[:10]

[['FAM1', 'M1', '0', '0', '2', '1', '1', '1', '1', '1', '1', '1', '1', '1'],
 ['FAM1', 'F1', '0', '0', '1', '1', '1', '2', '1', '1', '1', '1', '1', '2'],
 ['FAM1', 'O2', 'F1', 'M1', '1', '2', '1', '1', '1', '1', '1', '1', '1', '1'],
 ['FAM1', 'O1', 'F1', 'M1', '2', '2', '2', '1', '1', '1', '1', '1', '2', '1'],
 ['FAM2', 'M2', '0', '0', '2', '2', '2', '2', '1', '1', '1', '1', '1', '1'],
 ['FAM2', 'F2', '0', '0', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1'],
 ['FAM2', 'O2', 'F2', 'M2', '1', '2', '1', '2', '1', '1', '1', '1', '1', '1'],
 ['FAM2', 'O1', 'F2', 'M2', '2', '2', '1', '2', '1', '1', '1', '1', '1', '1'],
 ['FAM3', 'M3', '0', '0', '2', '1', '1', '1', '1', '1', '1', '1', '1', '1'],
 ['FAM3', 'F3', '0', '0', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1']]