Python to create lists. To understand more see list comprenhension

In [347]:
demographic_summary = [(i+1,0) for i in range(6)]
demographic_summary

[(1, 0), (2, 0), (3, 0), (4, 0), (5, 0), (6, 0)]

Building dictionaries in python

In [348]:
demographic_summary = dict(demographic_summary)
demographic_summary

{1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0}

Demographic information for number of children per household in the US as of 2019. Extracted from https://www.census.gov/data/tables/2019/demo/families/cps-2019.html table C3 (Living arrangements of children under 18 years and marital status of parents -- info taken from the Presence of siblings section). Ask Gao where did he take the info from...I don't see the values for over 5 offspring.

In [349]:
demographic_summary[1] = 14788000/73524000
demographic_summary[2] = 28464000/73524000
demographic_summary[3] = 18288000/73524000
demographic_summary[4] = 7636000/73524000
demographic_summary[5] = 2697000/73524000
demographic_summary[6] = 1651000/73524000

In [350]:
demographic_summary

{1: 0.20113160328600185,
 2: 0.3871388934225559,
 3: 0.24873510690386813,
 4: 0.10385724389315054,
 5: 0.036681899787824386,
 6: 0.022455252706599205}

Now we want to normalize the data such that we always sample families with a number of offspring >=2 

In [351]:
for k in demographic_summary:
    if k == 1 :
        continue
    else:
        demographic_summary[k] /= (1 - demographic_summary[1])
demographic_summary[1]= 0

In [352]:
list(demographic_summary.values())

[0,
 0.4846090983383275,
 0.3113593026423318,
 0.1300054481067829,
 0.0459173249795696,
 0.028108825932988288]

Now, we draw 1000 pedigrees from this multinomial distribution. Changing the **n** to the number of pedigrees we would like to simulate, in this case 1000.

In [353]:
import numpy as np
n = 200
data = np.random.multinomial(n, list(demographic_summary.values()))
data

array([ 0, 92, 59, 33, 11,  5])

In [354]:
data = dict([(k, x) for k, x in zip(demographic_summary.keys(), data)])
data

{1: 0, 2: 92, 3: 59, 4: 33, 5: 11, 6: 5}

Now, the step of pedigree generation using the code below. To write the generated ped file use the `open()` function

In [355]:
ped_file = '/Users/dmc2245/Documents/Cornejo_Diana/family-association/seqsimla/input/simped200.txt'
proband_file = '/Users/dmc2245/Documents/Cornejo_Diana/family-association/seqsimla/input/proband200.txt'
offspring_file = '/Users/dmc2245/Documents/Cornejo_Diana/family-association/seqsimla/input/offspring200.txt'

In [367]:
pedigree = open (ped_file,'w')
num_fam = 0
fam_id = sid = fid = mid = sex = phen = ''
for fam_type in data:
    if fam_type == 1:
    # single off-spring family
        continue
    for i in range(data[fam_type]):
        num_fam += 1
        fam_id = f'FAM{num_fam}'
        fid = f'F{num_fam}'
        mid = f'M{num_fam}'
        # for founders
        print(f"{fam_id} {fid} 0 0 1 0", end = "\n", file=pedigree)
        print(f"{fam_id} {mid} 0 0 2 0", end = "\n", file=pedigree)
        for j in range(fam_type) :
            sid = f"O{j+1}"
            n,p = 1, 0.5 
            sex = np.random.binomial(n, p)
            sex = f"{sex+1}"
            print(f"{fam_id} {sid} {fid} {mid} {sex} 0", end = "\n", file=pedigree)
f.close()

In [357]:
probands = open(proband_file, 'w')
#offspring = open(offspring_file, 'w')
num_fam = 0
fam_id = sid = fid = mid = sex = phen = ''
for fam_type in data:
    if fam_type == 1:
    # single off-spring family
        continue
    for i in range(data[fam_type]):
        num_fam += 1
        fam_id = f'FAM{num_fam}'
        fid = f'F{num_fam}'
        mid = f'M{num_fam}'
        #create proband file with unaffected parents and affected children
        print(f"{fam_id} {fid} 0 0 1 1", end = "\n", file=probands)
        print(f"{fam_id} {mid} 0 0 2 1", end = "\n", file=probands)
        for j in range(fam_type):
            sid = f"O{j+1}"
            n,p = 1, 0.5
            sex = np.random.binomial(n, p)
            sex = f"{sex+1}"
            print (f"{fam_id} {sid} {fid} {mid} {sex} 2", end = "\n", file=probands)
f.close()

In [358]:
print(proband_file)

/Users/dmc2245/Documents/Cornejo_Diana/family-association/seqsimla/input/proband200.txt


In [359]:
print(ped_file)

/Users/dmc2245/Documents/Cornejo_Diana/family-association/seqsimla/input/simped200.txt


In [368]:
data1 = [x.strip().split() for x in open(ped_file).readlines()]
data1

[['FAM1', 'F1', '0', '0', '1', '0'],
 ['FAM1', 'M1', '0', '0', '2', '0'],
 ['FAM1', 'O1', 'F1', 'M1', '2', '0'],
 ['FAM1', 'O2', 'F1', 'M1', '2', '0'],
 ['FAM2', 'F2', '0', '0', '1', '0'],
 ['FAM2', 'M2', '0', '0', '2', '0'],
 ['FAM2', 'O1', 'F2', 'M2', '2', '0'],
 ['FAM2', 'O2', 'F2', 'M2', '1', '0'],
 ['FAM3', 'F3', '0', '0', '1', '0'],
 ['FAM3', 'M3', '0', '0', '2', '0'],
 ['FAM3', 'O1', 'F3', 'M3', '1', '0'],
 ['FAM3', 'O2', 'F3', 'M3', '2', '0'],
 ['FAM4', 'F4', '0', '0', '1', '0'],
 ['FAM4', 'M4', '0', '0', '2', '0'],
 ['FAM4', 'O1', 'F4', 'M4', '2', '0'],
 ['FAM4', 'O2', 'F4', 'M4', '1', '0'],
 ['FAM5', 'F5', '0', '0', '1', '0'],
 ['FAM5', 'M5', '0', '0', '2', '0'],
 ['FAM5', 'O1', 'F5', 'M5', '2', '0'],
 ['FAM5', 'O2', 'F5', 'M5', '1', '0'],
 ['FAM6', 'F6', '0', '0', '1', '0'],
 ['FAM6', 'M6', '0', '0', '2', '0'],
 ['FAM6', 'O1', 'F6', 'M6', '1', '0'],
 ['FAM6', 'O2', 'F6', 'M6', '1', '0'],
 ['FAM7', 'F7', '0', '0', '1', '0'],
 ['FAM7', 'M7', '0', '0', '2', '0'],
 ['FAM7', 'O1'

Using pandas to create a dataframe from a list of lists: https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/

In [369]:
import pandas as pd
df = pd.DataFrame(data1, columns = ['FamID', 'IndID', 'FID', 'MID','Sex', 'Affection'])
print(df)

      FamID IndID   FID   MID Sex Affection
0      FAM1    F1     0     0   1         0
1      FAM1    M1     0     0   2         0
2      FAM1    O1    F1    M1   2         0
3      FAM1    O2    F1    M1   2         0
4      FAM2    F2     0     0   1         0
5      FAM2    M2     0     0   2         0
6      FAM2    O1    F2    M2   2         0
7      FAM2    O2    F2    M2   1         0
8      FAM3    F3     0     0   1         0
9      FAM3    M3     0     0   2         0
10     FAM3    O1    F3    M3   1         0
11     FAM3    O2    F3    M3   2         0
12     FAM4    F4     0     0   1         0
13     FAM4    M4     0     0   2         0
14     FAM4    O1    F4    M4   2         0
15     FAM4    O2    F4    M4   1         0
16     FAM5    F5     0     0   1         0
17     FAM5    M5     0     0   2         0
18     FAM5    O1    F5    M5   2         0
19     FAM5    O2    F5    M5   1         0
20     FAM6    F6     0     0   1         0
21     FAM6    M6     0     0   

In [370]:
# How many rows has the dataset
df['FamID'].count()

978

In [371]:
# How many entries are there for each FamID?
df['FamID'].value_counts()

FAM198    8
FAM197    8
FAM199    8
FAM196    8
FAM200    8
FAM193    7
FAM190    7
FAM195    7
FAM188    7
FAM192    7
FAM187    7
FAM194    7
FAM185    7
FAM191    7
FAM186    7
FAM189    7
FAM164    6
FAM173    6
FAM170    6
FAM167    6
FAM172    6
FAM180    6
FAM159    6
FAM176    6
FAM163    6
FAM168    6
FAM166    6
FAM154    6
FAM177    6
FAM169    6
         ..
FAM72     4
FAM83     4
FAM51     4
FAM67     4
FAM88     4
FAM24     4
FAM82     4
FAM78     4
FAM59     4
FAM44     4
FAM79     4
FAM17     4
FAM40     4
FAM89     4
FAM85     4
FAM54     4
FAM6      4
FAM53     4
FAM8      4
FAM64     4
FAM2      4
FAM74     4
FAM76     4
FAM33     4
FAM25     4
FAM37     4
FAM13     4
FAM19     4
FAM57     4
FAM42     4
Name: FamID, Length: 200, dtype: int64

In [372]:
# Maximum number of entries per family?
df['FamID'].max()

'FAM99'