# Generating Patient-Variant Mappings

##### Updated 02/20/2023
##### Lara Brown

#### Goal:

Generate a csv file detailing the most severe variant for each UK Biobank patient for a given gene.

#### Required inputs:

**Variant-patient mapping:** You will need a gene-level variant patient mapping ssv file with the following columns: `Chrom, Name (variant id), Pos, Ref, Alt, Carriers`. The list of carriers should be pipe-separated, though the methods used in this file can be modified to match other formats. 

**Annotated variant file:** You will also need a csv file of VEP-annotated variants for the same gene. This file can be generated on the BWH genetics cluster by modifying and running the wrapper script at `/net/ukbb/500k_processing_lara/parse_vcf_parallel.sh`. This script yields a .csv file that can then be uploaded to the RAP and used for the methods in this notebook. The two main components of this script are:

1. Running VEP using the following command:

```
vep --cache --force_overwrite --offline --dir /net/data/vep --assembly GRCh38 --plugin dbNSFP,/net/ukbb/data/dbnsfp_content/dbNSFP4.0a.gz,ALL --plugin LoF,loftee_path:/net/data/vep/loftee-grch38,human_ancestor_fa:/net/data/vep/loftee-grch38/human_ancestor.fa.gz --vcf --max_af --af_gnomade --af_gnomadg --no_check_variants_order --canonical --no_stats -I {gene_input.vcf.gz} -o {output_vep.vcf}
```

2. Parsing the VEP output by running `/net/ukbb/500k_processing_lara/vep_parser_af_lof.py`. Note that this parser requires that the VEP annotations include (a) max allele frequencies, both the default from dbNSFP (`allele_frequency`), as well as `gnomADe_MAX_AF` and `gnomADg_MAX_AF`, and (b) LoF confidence scores, obtained from the `loftee` VEP plugin. 

#### Generated outputs:

The methods in this notebook can be used to generate a gene-level csv file containing the following columns:

| Column name | Description | Data type |
| :- | :- | :-: |
| eid | ID of carrier in the UK Biobank | object
| variant_id/most_severe_variant | ID of the variant described in the remainder of the row. Format Chrom-Pos-Ref-Alt | object
| Synonymous | 1 if variant is synonymous; 0 otherwise| int64
| Missense | 1 if variant is missense; 0 otherwise| int64
| Deleterious | 1 if variant is one of: `stop_gained, start_lost, splice_acceptor_variant, splice_donor_variant, frameshift_variant`; 0 otherwise| int64
| coding_position | relative position of base pair in coding sequence | float64
| AA_name | for non-synonymous variants, lists the amino acid substitution as RefLocationAlt | object

#### Non-interactive option:

**The code in this notebook is not actively maintained. Please use this option for the most updated version!** The outputs in this notebook can (and should!!) also be generated by running the script stored on RAP at `patient_variant_extracts/scripts/var-pt-map.py`. To do so, you will need to create a Docker image based on Python 3.6 that properly installs Pandas and Numpy. Then, run the following commands on a local terminal where Python SDK/dx-toolkit has been properly configured:

```
cmd="python3 var-pt-map.py LDLR -v blend_ukb/resources/parsed_bc_genes -s -e VEST4_score,CADD,REVEL_score,ukb_af"
dx run swiss-army-knife -iin="file-GQVXx8jJqBj0Vg55zP03kXj5" -icmd="$cmd" -iimage_file="patient_variant_extracts/scripts/pandas_numpy.tar.gz" --destination "patient_variant_extracts/bc_genes" -y
```

The file ID specified as the Swiss Army Knife input corresponds to the aforementioned `var-pt-map.py` script. Replace LDLR with your gene of choice, or run in a for-loop over multiple genes. The Python script is set up to accept the following flags:

| Flag | Full name | Description | Data type |
| :- | :- | :- | :-: |
| gene | n/a | Required; symbol of gene to access | str 
| -v | --var_path | Name of directory in which annotated variant file is stored. Assumes file name is {gene}_parsed.csv | str
| -c | --car_path | Name of directory in which variant-patient `.ssv` mapping is stored. Assumes file name is {gene}.ssv | str
| -s | --most_severe | Includes only most severe variant per patient when specified | store_true
| -e | --extra_cols | Comma delimited list of additional VEP columns to include in output. Use to include, eg, computational scores like VEST4 or CADD9 in the output file. ukb_af can be included if it was retained at initial generation of var-pt SSVs. | comma separated str (converted to list) |


In [1]:
import pandas as pd
import numpy as np
from pandas.api.types import CategoricalDtype
from random import shuffle

### Packaging in functions

In [2]:
def get_variants(gene, dir_path="selected_genes/parsed_selected_genes_filtered_variants_5nt_flank"):
    """retrieves annotated and parsed variant file"""
    
    path = f'/mnt/project/{dir_path}/{gene}_parsed.csv'
    
    return(pd.read_csv(path))
    

In [3]:
def get_carriers(gene, dir_path="selected_genes/selected_genes_var_patient_5nt_flank"):
    """retrieve variant-patient mapping file"""
    
    path = f'/mnt/project/{dir_path}/{gene}.ssv'
    carriers = pd.read_csv(path,
                       sep=" ",
                       names=["Chrom", "na", "Pos", "Ref", "Alt", "Carriers"],
                       usecols=lambda x: x != 'na')
    carriers.loc[:, "Carriers"] = carriers["Carriers"].str.strip("|")
    carriers.loc[:, "Carriers"]= carriers["Carriers"].str.split("|", expand = False)
    carriers = carriers.explode("Carriers").rename(columns={"Carriers": "Carrier"})
    carriers = carriers[carriers['Carrier'].apply(lambda x: not x.startswith('W'))] # remove withdrawn
    carriers["Chrom"] = carriers["Chrom"].str.extract('(\d+)').astype("int64")
    
    return(carriers)

In [4]:
def get_joined(gene, **paths):
    """retrieves both variant and mapping files and joins them together by variant"""
    
    variants = get_variants(gene, paths["var_path"]) if "var_path" in paths else get_variants(gene)
    carriers = get_carriers(gene, paths["car_path"]) if "car_path" in paths else get_carriers(gene)
    ann_carriers = carriers.join(variants.set_index(["Chrom", "Pos", "Ref", "Alt"]), on=["Chrom", "Pos", "Ref", "Alt"])
    
    return(ann_carriers)

Note that `get_select_carriers` below uses the Deleterious category as defined in the VEP parser, which includes splice acceptor/donor variants, frameshift, stop gained, and start lost. The list in the first version of this function excludes start lost variants, so the total number of records will differ depending on which version is used. Also, the Deleterious category contains variants annotated with multiple consequences, at least one of which is deleterious, whereas the list from the other version of this function requires a perfect string match and thus would not count any variant with multiple consequences as deleterious. This latter criterion, for example, adds an extra 1,008 deleterious variant/carrier pairs in LDLR when including doubly-annotated consequences.

In [5]:
def get_select_carriers(ann_carriers):
    """
    accepts a joined carrier-variant df.
    filters by consequence and creates a string to represent amino acid substitutions when present
    """
    ann_carriers = ann_carriers.loc[(ann_carriers["consequence"].isin(["Missense", "Synonymous"])) |
                                   ((ann_carriers["consequence"] == "Deleterious") & (ann_carriers["LoF_confidence"] == "HC"))]
    ann_carriers._is_copy = False # gets rid of SettingWithCopyWarning
    ann_carriers.loc[ann_carriers.consequence == "Deleterious", "VEST4_score"] = 1 # set vest4 to max for del variants

    if ann_carriers.loc[:, "AA_pos"].dtypes != "object":
        ann_carriers.loc[:,"AA_pos"] = ann_carriers["AA_pos"].astype('Int64').astype('str')
    ann_carriers.loc[ann_carriers.AA_alt == "Stop", "AA_alt"] = "*"
    ann_carriers.loc[ann_carriers.AA_ref == "Stop", "AA_ref"] = "*"
    ann_carriers.loc[:, "AA_name"] = ann_carriers["AA_ref"] + ann_carriers["AA_pos"] + ann_carriers["AA_alt"]
    ann_carriers.loc[ann_carriers.consequence == "Synonymous", "AA_name"] = np.NaN

    return(ann_carriers)


In [6]:
def prep_for_pivot(ann_carriers):
    """accepts a joined and filtered patient-variant df and fills in missing values to allow pandas pivot"""
    
    ann_carriers.loc[ann_carriers["CADD"].isnull(), "CADD"] = "dummy"
    ann_carriers.loc[ann_carriers["VEST4_score"].isnull(), "VEST4_score"] = "dummy"
    ann_carriers.loc[ann_carriers["coding_position"].isnull(), "coding_position"] = "dummy"
    ann_carriers.loc[ann_carriers["AA_name"].isnull(), "AA_name"] = "dummy"
    ann_carriers.loc[:, "val"] = 1

    return(ann_carriers)

In [7]:
def get_most_severe(ann_carriers):
    """
    accepts a filtered patient-variant df.
    selects and retains only the most severe variant per patient. returns this df.
    """
    
    # order variant consequence by general severity (deleterious > missense > synonymous)
    consequence_cats = CategoricalDtype(categories=["Synonymous", "Missense", "Deleterious"], ordered=True)
    ann_carriers.loc[:, "consequence"] = ann_carriers["consequence"].astype(consequence_cats)
    
    # choose maximum AF for each variant, where possible
    ann_carriers.loc[:, "max_AF"] = ann_carriers.loc[:, ["allele_frequency", "gnomADe_MAX_AF", "gnomADg_MAX_AF"]].max(axis=1, skipna=True)
    
    # order deleterious variance confidence levels (high > low)
    del_confidence_cats = CategoricalDtype(categories=["LC", "HC"], ordered=True)
    ann_carriers.loc[:, "LoF_confidence"] = ann_carriers["LoF_confidence"].astype(del_confidence_cats)
 
    # create a random tie breaker index
    shuffled_index = list(range(0,len(ann_carriers)))
    shuffle(shuffled_index)
    ann_carriers.loc[:, "tie_breaker"] = shuffled_index
    
    # group by eid, sort by consequence > CADD > confidence (for del. vars) > AF > tiebreaker
    # select top variant for each group
    ann_carriers_grouped = ann_carriers.groupby("Carrier", group_keys=False).apply(pd.DataFrame.sort_values, ["consequence", "CADD", "LoF_confidence", "max_AF", "tie_breaker"], ascending=[False, False, False, True, True])
    ann_carriers_grouped = ann_carriers_grouped.drop_duplicates("Carrier", keep="first")
    
    # important!! return categorical variables to normal objects so they don't clog pivot memory
    ann_carriers_grouped.loc[:, "consequence"] = ann_carriers_grouped["consequence"].astype(object)
    ann_carriers_grouped.loc[:, "LoF_confidence"] = ann_carriers_grouped["LoF_confidence"].astype(object)
    
    return(ann_carriers_grouped)

In [8]:
def reshape_consequence(ann_carriers, rename_id=False):
    """
    accepts filtered and prepped patient-variant df
    returns a reshaped df that separates consequences into unique columns
    """
    
    ann_carriers = ann_carriers.pivot_table(index=["Carrier", "Name", "coding_position", "CADD", "VEST4_score", "AA_name"], values="val", columns="consequence", fill_value=0)
    ann_carriers = ann_carriers.reset_index().replace('dummy',np.nan)

    new_names={"Carrier": "eid", "Name": "variant_id"}
    
    # this variable is set to true whenever get_most_severe was run before this function
    if rename_id:
        new_names["Name"] = "most_severe_variant"

    selected_cols=["Carrier", "Name", "Synonymous", "Missense", "Deleterious", "coding_position", "AA_name"]
    
    ann_carriers = ann_carriers.reindex(columns=selected_cols)
    ann_carriers = ann_carriers.rename(columns=new_names)
    ann_carriers.columns.name = None
    
    return(ann_carriers)

In [9]:
def patient_var_mappings(gene, most_severe=False, **paths):
    """
    accepts gene name and some combination of var_path and car_path directories.
    returns a pandas df with one row per patient detailing their most severe variant
    """
    annotated_carriers = get_joined(gene, **paths)
    annotated_carriers = get_select_carriers(annotated_carriers)
    annotated_carriers = prep_for_pivot(annotated_carriers)
    
    if most_severe:
        annotated_carriers = get_most_severe(annotated_carriers)
        
    return(reshape_consequence(annotated_carriers, rename_id=most_severe))

### Obtaining extracts!

In [None]:
genes = ["TP53", "BRCA1", "BRCA2", "LDLR"]
for g in genes:
    file_name=f'{g}.csv'
    patient_var_mappings(g, var_path="parsed_bc_genes").to_csv(file_name, index=False)

In [46]:
# adapt this code to test the sorting process for patients with multiple variants
# test = ldlr_carriers_grouped[["Carrier", "Name", "vep_consequence", "VEST4_score", "CADD", "LoF_confidence", "max_AF", "tie_breaker"]]
# count = ldlr_carriers.groupby("Carrier").size().where(lambda x : x>2).dropna()
# c = ldlr_carriers.loc[ldlr_carriers['LoF_confidence'].notna()]["Carrier"].unique()

# multiple_var_ids = count.index.values.tolist()
# test[test["Carrier"].isin(multiple_var_ids)].head(25)

Unnamed: 0,Carrier,Name,vep_consequence,VEST4_score,CADD,LoF_confidence,max_AF,tie_breaker
243,1237033,19-11110681-G-A,missense_variant,0.549,3.861107,,0.014703,9912
186,1237033,19-11107402-C-T,synonymous_variant,,,,0.003629,9745
194,1237033,19-11107432-C-T,synonymous_variant,,,,0.004122,7334
186,1367116,19-11107402-C-T,synonymous_variant,,,,0.003629,11015
194,1367116,19-11107432-C-T,synonymous_variant,,,,0.004122,11122
7,1367116,19-11100245-C-T,synonymous_variant,,,,0.03115,1198
265,1451411,19-11110735-G-A,missense_variant,0.187,1.39181,,0.006591,3412
175,1451411,19-11106660-A-T,missense_variant,0.284,1.081236,,,8550
650,1451411,19-11120455-G-A,synonymous_variant,,,,0.000131,1457
265,1623518,19-11110735-G-A,missense_variant,0.187,1.39181,,0.006591,9171


In [13]:
patient_var_mappings("TP53", var_path="cancer_func_score/parsed_bc_genes", most_severe=True)

Unnamed: 0,eid,most_severe_variant,Synonymous,Missense,Deleterious,coding_position,AA_name
0,1001862,17-7669642-G-A,1,0,0,1149.0,
1,1002783,17-7670613-A-C,0,1,0,1096.0,S234A
2,1004632,17-7676159-A-T,1,0,0,210.0,
3,1006957,17-7676589-C-T,1,0,0,6.0,
4,1009560,17-7670684-C-A,0,1,0,1025.0,R210L
...,...,...,...,...,...,...,...
3817,6022267,17-7676047-C-G,0,1,0,322.0,G108R
3818,6022834,17-7676230-G-A,0,1,0,139.0,P47S
3819,6023144,17-7673751-C-T,0,1,0,869.0,R290H
3820,6023387,17-7676589-C-T,1,0,0,6.0,
