# Proband Extraction

The goal of this Jupyter notebook is to extract the probands from the BALSAC population register who married in Saguenay–Lac-Saint-Jean between 1931 and 1960, roughly correspondly to the last generation with the most complete genealogy.

In [1]:
import csv
import json
import pickle
import geneakit as gen

The paths to the datasets are found in the *paths.json* file.

In [2]:
with open("../paths.json", 'r') as file:
    paths = json.load(file)

The following loads the BALSAC genealogy.

In [3]:
ped = gen.genealogy(paths['balsac_genealogy'])
ped

A pedigree with:
 - 6140246 individuals;
 - 10817946 parent-child relations;
 - 3047310 men;
 - 3092936 women;
 - 3502836 probands;
 - 19 generations.

The following lists all individuals in the genealogy.

In [4]:
inds = ped.keys()
len(inds)

6140246

The following creates dictionaries that are used to convert city and region codes from BALSAC into human-readable name strings.

In [None]:
city_code_to_string = {}
with open(paths['geography_definitions'], 'r') as file:
    reader = csv.reader(file)
    next(reader)  # Skip header row
    for row in reader:
        UrbMariage, UrbIdMariage, *_ = row
        city_code_to_string[int(UrbIdMariage)] = UrbMariage

region_code_to_string = {
    25705: "Abitibi",
    25706: "Bas-Saint-Laurent",
    25707: "Beauce",
    25708: "Bois-Francs",
    25709: "Charlevoix",
    25710: "Côte-de-Beaupré",
    25711: "Côte-du-Sud",
    25712: "Côte-Nord",
    25713: "Estrie",
    25714: "Gaspésie",
    25715: "Île-de-Montréal",
    25716: "Îles-de-la-Madeleine",
    25717: "Lanaudière",
    25718: "Laurentides",
    25719: "Mauricie",
    25720: "Outaouais",
    25721: "Agglomération de Québec",
    25722: "Région de Québec",
    25723: "Nord du Québec",
    25724: "Richelieu",
    25725: "Rive Nord-Ouest de Montréal",
    25726: "Rive Sud de Montréal",
    25727: "Saguenay–Lac-Saint-Jean",
    25728: "Témiscamingue",
    27118: "Côte-de-Beaupré",
    27119: "Portneuf",
    27120: "Lévis-Lotbinière"
}

The following generates dictionaries for identifying demographical information from an individual and their parents.

In [6]:
city_proband = {}
region_proband = {}
year_proband = {}
city_parent = {}
region_parent = {}

with open(paths['demography_information'], 'r', encoding='cp1252') as file:
    reader = csv.reader(file, delimiter=';')
    next(reader)  # Skip header row
    for row in reader:
        if len(row) == 0: break
        IndID, CaG, ERRQ, PereID, MereID, Sexe, PaysOrigine, DateNaissance, RegionNaissance, \
        DateDeces, RegionDeces, DateMariage, ConjointID, URBMariage, RegionMariage, \
        DateMariageParents, URBMariageParents, RegionMariageParents = row

        IndID = int(IndID) if IndID != 'NA' else 0
        URBMariage = int(URBMariage) if URBMariage != 'NA' else 0
        RegionMariage = int(RegionMariage) if RegionMariage != 'NA' else 0
        DateMariage = int(DateMariage) if DateMariage != 'NA' else 0
        URBMariageParents = int(URBMariageParents) if URBMariageParents != 'NA' else 0
        RegionMariageParents = int(RegionMariageParents) if RegionMariageParents != 'NA' else 0

        city_proband[IndID] = city_code_to_string.get(URBMariage, 0)
        region_proband[IndID] = region_code_to_string.get(RegionMariage, 0)
        year_proband[IndID] = DateMariage
        city_parent[IndID] = city_code_to_string.get(URBMariageParents, 0)
        region_parent[IndID] = region_code_to_string.get(RegionMariageParents, 0)

The next three booleans are used to identify individuals who married in Saguenay–Lac-Saint-Jean between 1931 and 1960.

In [7]:
in_slsj = [region_proband.get(ind, 'Unknown') == 'Saguenay–Lac-Saint-Jean' for ind in inds]
sum(in_slsj)

175747

In [8]:
from_1931 = [year_proband.get(ind, 0) >= 1935 for ind in inds]
sum(from_1931)

1924461

In [9]:
until_1960 = [year_proband.get(ind, 1961) <= 1960 for ind in inds]
sum(until_1960)

5899837

The three booleans are used for identifying the generation.

In [10]:
generation = [ind for ind, cond1, cond2, cond3
              in zip(inds, in_slsj, from_1931, until_1960)
              if cond1 and cond2 and cond3]
len(generation)

88848

A new genealogy is formed from the original genealogy in order to filter out the parents, so we only keep the probands.

In [11]:
iso_ped = gen.branching(ped, pro=generation)
iso_ped

A pedigree with:
 - 369427 individuals;
 - 693062 parent-child relations;
 - 182912 men;
 - 186515 women;
 - 80348 probands;
 - 18 generations.

The following identifies the probands.

In [12]:
pro = gen.pro(iso_ped)
len(pro)

80348

The list of probands is finally saved as a pickle.

In [13]:
with open(paths['wd'] + "results/pickles/balsac_probands.pkl", 'wb') as file:
    pickle.dump(pro, file)