# Create hard sample filters

For analysis, we need a cohort of samples with minimal population structure, minimal relatedness and without a few rare sources of error. This notebook generates a list of samples to remove from analysis in order to create such a cohort. We start out by importing stuff, initialising pyspark, setting various parameters from the configuration file, initialising Hail, and loading the participant dataset.

In [1]:
from pathlib import Path
import subprocess

import dxdata
import dxpy
import hail as hl
import pyspark
import tomli

from utils import fields_for_id

In [2]:
# Initialise Spark

sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)

In [3]:
# Set configurations

with open("../config.toml", "rb") as f:
    conf = tomli.load(f)

RAW_REL_FILE = conf["SAMPLE_QC"]["UKB_REL_DAT_FILE"]
FINAL_FILTER_FILE = conf["SAMPLE_QC"]["SAMPLE_FILTER_FILE"]

MAX_KINSHIP = conf["SAMPLE_QC"]["MAX_KINSHIP"]

LOG_FILE = Path(conf["IMPORT"]["LOG_DIR"], f"sample_filters.log").resolve().__str__()
TMP_DIR = Path(conf["EXPORT"]["TMP_DIR"])
DATA_DIR = Path(conf["SAMPLE_QC"]["DATA_DIR"])

In [4]:
# Initialise Hail

hl.init(sc=sc, default_reference="GRCh38", log=LOG_FILE)

pip-installed Hail requires additional configuration options in Spark referring
  to the path to the Hail Python module directory HAIL_DIR,
  e.g. /path/to/python/site-packages/hail:
    spark.jars=HAIL_DIR/hail-all-spark.jar
    spark.driver.extraClassPath=HAIL_DIR/hail-all-spark.jar
    spark.executor.extraClassPath=./hail-all-spark.jarRunning on Apache Spark version 2.4.4
SparkUI available at http://ip-10-60-188-36.eu-west-2.compute.internal:8081
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.61-3c86d3ba497a
LOGGING: writing to /opt/notebooks/gogoGPCR/hail_logs/sample_filters.log


In [5]:
# Load participant dataset

dispensed_database_name = dxpy.find_one_data_object(
    classname="database", name="app*", folder="/", name_mode="glob", describe=True
)["describe"]["name"]
dispensed_dataset_id = dxpy.find_one_data_object(
    typename="Dataset", name="app*.dataset", folder="/", name_mode="glob"
)["id"]

dataset = dxdata.load_dataset(id=dispensed_dataset_id)
participant = dataset["participant"]

# Filtering non-Caucasians and rare errors
We filter out non-Caucasians (22006), outliers for heterozygosity or missing rate (22027), sex chromosome aneuploidy (22019) or genetic kinship to other participants (22021, UKB defined).

In [6]:
# Find relevant field names

fields = ["22027", "22019", "22006", "22021"]
field_names = [
    fields_for_id(i, participant) for i in fields
]  # fields_for_id("22027") + fields_for_id("22019") + fields_for_id("22006") + fields_for_id("22021")
field_names = ["eid"] + [field.name for fields in field_names for field in fields]

In [7]:
# Retrieve dataframe

df = participant.retrieve_fields(
    names=field_names, engine=dxdata.connect(), coding_values="replace"
)

In [8]:
df.show(5, truncate=False)

+-------+------+------+---------+--------------------------------+
|eid    |p22027|p22019|p22006   |p22021                          |
+-------+------+------+---------+--------------------------------+
|3888244|null  |null  |Caucasian|No kinship found                |
|1795659|null  |null  |Caucasian|No kinship found                |
|2084720|null  |null  |Caucasian|At least one relative identified|
|3742232|null  |null  |Caucasian|At least one relative identified|
|1094442|null  |null  |Caucasian|At least one relative identified|
+-------+------+------+---------+--------------------------------+
only showing top 5 rows



In [9]:
# Use hard filters

df = df.filter(
    # df.p22006.isNull() | regenie should be able to handle population structure
    (~df.p22027.isNull())
    | (~df.p22019.isNull())
    | (df.p22021 == "Participant excluded from kinship inference process")
    | (df.p22021 == "Ten or more third-degree relatives identified")
)
filtered_samples_to_remove = hl.Table.from_spark(df.select("eid")).key_by("eid")
print(f"Samples to be filtered: {filtered_samples_to_remove.count()}")

Samples to be filtered: 1815


# Filter related samples
UK Biobank provides a list of genetically related individuals (KING) called 'ukb_rel.dat' which contains a kinship coefficient between pairs of individuals. Here, we remove any sample with a closer than 3rd degree relative (kinship > 0.088) and which is not already filtered out in the previous step. We then use Hail to create a maximal independent set of individuals by removing the smallest amount of related individuals. This is finally combined with the previously filtered samples to give the final list of samples to remove from the analysis.

In [10]:
# Import related table, remove any individual already sampled and keep those with kinship > 0.088

rel = hl.import_table(
    "file:" + "/mnt/project/" + RAW_REL_FILE,
    delimiter=" ",
    impute=True,
    types={"ID1": "str", "ID2": "str"},
)

rel = (
    rel.key_by("ID2")
    .anti_join(filtered_samples_to_remove)
    .key_by("ID1")
    .anti_join(filtered_samples_to_remove)
)

rel = rel.filter(rel.Kinship > MAX_KINSHIP, keep=True)

print(
    f"Related samples not already in filter and low kinship coefficient: {rel.count()}"
)

2021-11-30 09:17:17 Hail: INFO: Reading table to impute column types
2021-11-30 09:17:20 Hail: INFO: Finished type imputation
  Loading field 'ID1' as type str (user-supplied type)
  Loading field 'ID2' as type str (user-supplied type)
  Loading field 'HetHet' as type float64 (imputed)
  Loading field 'IBS0' as type float64 (imputed)
  Loading field 'Kinship' as type float64 (imputed)
2021-11-30 09:17:20 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-11-30 09:17:20 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-11-30 09:17:23 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-11-30 09:17:23 Hail: INFO: Ordering unsorted dataset with network shuffle


Related samples not already in filter and low kinship coefficient: 40096


In [11]:
# Find maximal independent set

related_samples_to_remove = (
    hl.maximal_independent_set(
        i=rel.ID1,
        j=rel.ID2,
        keep=False,
    )
    .rename({"node": "eid"})
    .key_by("eid")
)

print(
    f"Samples to remove to create independent set: {related_samples_to_remove.count()}"
)

2021-11-30 09:17:25 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-11-30 09:17:25 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-11-30 09:17:26 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-11-30 09:17:26 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-11-30 09:17:28 Hail: INFO: wrote table with 40096 rows in 1 partition to /tmp/T8jzHq8ke3YQas5IYM5l9s
    Total size: 582.47 KiB
    * Rows: 582.46 KiB
    * Globals: 11.00 B
    * Smallest partition: 40096 rows (582.46 KiB)
    * Largest partition:  40096 rows (582.46 KiB)
2021-11-30 09:17:30 Hail: INFO: Ordering unsorted dataset with network shuffle


Samples to remove to create independent set: 34708


In [12]:
# Join the two sets of samples to remove

final = related_samples_to_remove.join(filtered_samples_to_remove, how="outer")
print(f"Final number of samples to remove: {final.count()}")

2021-11-30 09:17:30 Hail: INFO: Table.join: renamed the following fields on the right to avoid name conflicts:
    'eid' -> 'eid_1'
2021-11-30 09:17:30 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-11-30 09:17:30 Hail: INFO: Ordering unsorted dataset with network shuffle


Final number of samples to remove: 36523


In [13]:
# Export list

FILTER_PATH = (TMP_DIR / FINAL_FILTER_FILE).resolve().__str__()
PROCESSED_DIR = (DATA_DIR.parents[0].stem / Path(DATA_DIR.stem)).__str__() + "/"

final.export("file:" + FILTER_PATH)

2021-11-30 09:17:31 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-11-30 09:17:31 Hail: INFO: Ordering unsorted dataset with network shuffle
2021-11-30 09:17:32 Hail: INFO: merging 17 files totalling 285.3K...
2021-11-30 09:17:33 Hail: INFO: while writing:
    file:/opt/notebooks/gogoGPCR/tmp/samples_to_remove.tsv
  merge time: 134.934ms


In [14]:
# Upload to project

subprocess.run(
    ["dx", "upload", FILTER_PATH, "--path", PROCESSED_DIR], check=True, shell=False
)

CompletedProcess(args=['dx', 'upload', '/opt/notebooks/gogoGPCR/tmp/samples_to_remove.tsv', '--path', 'Data/filters/'], returncode=0)