# Fine-mapping with PolyFun

## Aim

The purpose of this notebook is to demonstrate a functionally-informed fine-mapping workflow using the PolyFun method.

## Methods Overview 

## Input 

1) GWAS summary statistics including the following variables: 

- variant_id - variant ID 
- P - p-value 
- CHR - chromosome number 
- BP - base pair position
- A1 - The effect allele (i.e., the sign of the effect size is with respect to A1)
- A2 - the second allele 
- MAF - minor allele frequency 
- BETA - effect size 
- SE - effect size standard error

2) SNP-identifier file or S-LDSC (stratified LD-score regression) LD-score and annotation file

   SNP-identifier file should include the following columns: 

- CHR - chromosome
- BP - base pair position (in hg19 coordinates)
- A1 - The effect allele 
- A2 - the second allele

3) Ld-score weights file 


## Output

A `.gz` file containing input summary statistics columns and additionally the following columns:

- PIP - posterior causal probability
- BETA_MEAN - posterior mean of causal effect size (in standardized genotype scale)
- BETA_SD - posterior standard deviation of causal effect size (in standardized genotype scale)
- CREDIBLE_SET - the index of the first (typically smallest) credible set that the SNP belongs to (0 means none).


## Workflow

### Step 1: Compute Prior Causal Probabilities

#### Method 1: Use precomputed prior causal probabilities

Use precomputed prior causal probabilities of 19 million imputed UK Biobank SNPs with MAF>0.1%, based on a meta-analysis of 15 UK Biobank traits. 

In [None]:
[prior_causal_prob]
parameter: sumstats = 
bash: container='/mnt/mfs/statgen/tl3030/SIF/polyfun_ninth.sif'
    mkdir -p /mnt/mfs/statgen/tl3030/AD_output
    python /mnt/mfs/statgen/tl3030/extract_snpvar.py \
        --sumstats AD_sumstats_Jansenetal_2019sept.txt.gz \
        --out /mnt/mfs/statgen/tl3030/AD_output/AD_snps_with_var.gz \
        --allow-missing
    cat /mnt/mfs/statgen/tl3030/AD_output/AD_snps_with_var.gz | zcat | head

#### Method 2: Compute via L2-regularized extension of S-LDSC (preferred)

Compute via an L2-regularized extension of stratified LD-score regression (S-LDSC). Procedure for both methods is shown in this workflow. 

In [None]:
[munged_sumstats]
parameter: sumstats = 
bash: container='/mnt/mfs/statgen/tl3030/SIF/polyfun_ninth.sif'
    mkdir -p /mnt/mfs/statgen/tl3030/AD_SLDSC_output
    python /mnt/mfs/statgen/tl3030/munge_polyfun_sumstats.py \
      --sumstats AD_sumstats_Jansenetal_2019sept.txt.gz \
      --n 450734 \
      --out /mnt/mfs/statgen/tl3030/AD_SLDSC_output/sumstats_munged.parquet \
      --min-info 0 \
      --min-maf 0

### Step 2: Create functional annotations 

#### Method 1: Use existing function annotation files 

Use functional annotations for ~19 million UK Biobank imputed SNPs with MAF>0.1%, based on the baseline-LF 2.2.UKB annotations

Download (30G): https://data.broadinstitute.org/alkesgroup/LDSCORE/baselineLF_v2.2.UKB.polyfun.tar.gz

#### Method 2: Create annotations 

To create your own annotations, for each chromosome, the following files are needed: 

1) A `.gz` or `.parquet` Annotations file containing the following columns:

- CHR - chromosome number
- BP base pair position
- SNP - dbSNP reference number 
- A1 - The effect allele 
- A2 - the second allele
- Arbitrary additional columns representing annotations 

2) A `.l2.M` white-space delimited file containing a single line with the sums of the columns of each annotation

3) (Optional) A `l2.M_5_50` file that is the `.l2.M` but only containing common SNPS (MAF between 5% and 50%) 


### Step 3: Compute LD-scores for annotations 

#### Method 1: Compute with reference panel of sequenced individuals 

Reference panel should have at least 3000 sequenced individuals from target population.

In [None]:
[ld_score]
bash: container = ''
    mkdir -p
    python compute_ldscores.py \
    --bfile example_data/reference.1 \
    --annot example_data/annotations.1.annot.parquet \
    --out output/ldscores_example.parquet

#### Method 2: Compute with pre-computed UK Biobank LD matrices 

Matrices download: https://data.broadinstitute.org/alkesgroup/UKBB_LD

In [None]:
[ld_score_uk]
base: container = ''
    mkdir -p 
    python compute_ldscores_from_ld.py \
    --annot example_data/annotations.1.annot.parquet \
    --ukb \
    --out output/ldscores_example2.parquet

#### Method 3: Compute with own pre-computed LD matrices

Own pre-computed LD matrices should be in `.bcor` format. 

In [None]:
[ld_score_own]
base: container = ''
    mkdir -p 
    python compute_ldscores_from_ld.py \
    --annot example_data/annotations.1.annot.parquet \
    --out output/ldscores_example3.parquet \
    --n 10000 \
    bcor_files/*.bcor

### Step 4: Run PolyFun with L2-regularized S-LDSC

If prior causal probabilities aren't computed,then use `finemapper.py` instead of `polyfun.py` to perform non-functionally-informed fine-mapping. 

In [None]:
[L2_regu_SLDSC]
bash: container='/mnt/mfs/statgen/tl3030/SIF/polyfun_ninth.sif'
    python /mnt/mfs/statgen/tl3030/polyfun.py \
    --compute-h2-L2 \
    --no-partitions \
    --output-prefix /mnt/mfs/statgen/tl3030/AD_SLDSC_output/testrun \
    --sumstats /mnt/mfs/statgen/tl3030/AD_SLDSC_output/sumstats_munged.parquet \
    --ref-ld-chr /mnt/mfs/statgen/tl3030/baselineLF2.2.UKB/baselineLF2.2.UKB. \
    --w-ld-chr /mnt/mfs/statgen/tl3030/weights.UKB.l2.ldscore/weights.UKB. \
    --allow-missing

## Minimal Working Example

In [None]:
module load Singularity

In [None]:
cat /mnt/mfs/statgen/tl3030/GCST90012877_buildGRCh37.tsv.gz | zcat | head -2

In [None]:
[modification]

python: expand = "${ }"
    import pandas as pd
    import numpy as np


    # Read in data
    sumstat = pd.read_csv('/mnt/mfs/statgen/tl3030/GCST90012877_buildGRCh37.tsv', sep='\t', header=0)

    # Rename columns so that munge_polyfun_sumstats.py could recognize it
    sumstat.rename(columns={'SNP_ID': 'SNP', 'chromosome': 'CHR', 'base_pair_location': 'BP', 
                        'effect_allele': 'A1', 'other_allele': 'A2', 'effect_allele_frequency': 'MAF', 
                        'p_value': 'P', 'standard_error': 'SE'}, inplace=True, errors='raise')

    # Replace NaN with string `chr_bp_ref_alt`
    cols = ['CHR', 'BP', 'A2', 'A1']

    sumstat['variant_id'] = sumstat['variant_id'].fillna('refill')

    for k,row in sumstat.iterrows():
        if row['variant_id'] == 'refill':
            replace = '_'.join(row[cols].values.astype(str))
            print(k)
            sumstat.loc[k, 'variant_id']=replace 

    # Write out data in a .txt file
    sumstat.to_csv('/mnt/mfs/statgen/tl3030/GCST90012877_buildGRCh37_colrenamed.txt', index=False, sep='\t', mode='w')

In [None]:
[munged_sumstats]
bash: container='/mnt/mfs/statgen/tl3030/SIF/polyfun_ninth.sif'
    mkdir -p /mnt/mfs/statgen/tl3030/AD_2021_output
    python /mnt/mfs/statgen/tl3030/munge_polyfun_sumstats.py \
      --sumstats /mnt/mfs/statgen/tl3030/GCST90012877_buildGRCh37_colrenamed.txt.gz \
      --n 472868 \
      --out /mnt/mfs/statgen/tl3030/AD_2021_output/sumstats_munged.parquet \
      --min-info 0.6 \
      --min-maf 0.001

In [None]:
sos run /mnt/mfs/statgen/tl3030/Untitled1.ipynb munged_sumstats

In [None]:
import pandas as pd
munged_file = pd.read_parquet('/mnt/mfs/statgen/tl3030/AD_2021_output/sumstats_munged.parquet', engine='auto')
print(munged_file.head())

In [None]:
[L2_regu_SLDSC]
bash: container='/mnt/mfs/statgen/tl3030/SIF/polyfun_ninth.sif'
    python /mnt/mfs/statgen/tl3030/polyfun.py \
    --compute-h2-L2 \
    --no-partitions \
    --output-prefix /mnt/mfs/statgen/tl3030/AD_2021_output/testrun \
    --sumstats /mnt/mfs/statgen/tl3030/AD_2021_output/sumstats_munged.parquet \
    --ref-ld-chr /mnt/mfs/statgen/tl3030/baselineLF2.2.UKB/baselineLF2.2.UKB. \
    --w-ld-chr /mnt/mfs/statgen/tl3030/weights.UKB.l2.ldscore/weights.UKB. \
    --allow-missing

In [None]:
sos run /mnt/mfs/statgen/tl3030/Untitled1.ipynb L2_regu_SLDSC

### Summary

In [None]:
import os.path
# get the location of finemapping result files
file_with_annot_location = os.path.join('/mnt', 'mfs', 'statgen','tl3030','AD_2021_output','with_annot', 'finemap.*.gz')
print(file_with_annot_location)

In [None]:
import glob
# get a list of result file name
filenames_with_annot = glob.glob(file_with_annot_location)
print(len(filenames_with_annot))
print(filenames_with_annot)

In [None]:
snp_with_annot = pd.DataFrame()

for f in filenames_with_annot:
    # read the data
    outfile = pd.read_csv(f, delimiter = "\t")
    
    # filter out SNPs that has PIP >= 0.95
    significant = (outfile[outfile['PIP']>=0.95])
    snp_with_annot = snp_with_annot.append(significant)

print(snp_with_annot.head(5))

In [None]:
# define a function to check duplicate
def checkIfDuplicates(snp):
    ''' Check if given list contains any duplicates '''
    if len(snp) == len(set(snp)):
        return False
    else:
        return True

# check if there is duplicates
result = checkIfDuplicates(snp)
if result:
    print('Yes, list contains duplicates')
else:
    print('No duplicates found in list') 

In [None]:
# remove duplicated SNPs
snp_with_annot_uniq = snp_with_annot.drop_duplicates(subset='SNP', keep='first')
print(snp_with_annot_uniq.head(5))
print(snp_with_annot_uniq.shape)

In [None]:
CS_with_annot = pd.DataFrame()

for f in filenames_with_annot:
    # read the data
    outfile = pd.read_csv(f, delimiter = "\t")
    
    # filter out SNPs that has CS
    significant = (outfile[outfile['CREDIBLE_SET']>0])
    CS_with_annot = CS_with_annot.append(significant)

print(CS_with_annot.head(5))
print(CS_with_annot.shape)

In [None]:
# remove duplicated SNPs
CS_with_annot_uniq = CS_with_annot.drop_duplicates(subset='SNP', keep='first')
print(CS_with_annot_uniq.head(5))
print(CS_with_annot_uniq.shape)

In [None]:
pd.options.mode.chained_assignment = None

In [None]:
# Read in the range file
region_range = pd.read_csv("/mnt/mfs/statgen/tl3030/range.csv").dropna()
#chr1_160990767_161203192 = pd.read_csv("/mnt/mfs/statgen/tl3030/finemapping_result_97gene/finemap.1.160990767.161203192.gz", delimiter = "\t")
#print(chr1_160990767_161203192.head())

bpcol = CS_with_annot_uniq[['CHR', 'BP']]
#print(bpcol.head())

# Assign SNPs to the gene region that it belong to
j = 0
for i, bp in bpcol.iterrows():
    #print(i, bp['CHR'])
    for k,row in region_range.iterrows():
        if (bp['CHR'] == row['Chr']) and (bp['BP'] > row['start']) and (bp['BP'] < row['end']):
            #print(row['Chr'],row['Gene Name'])
            #print(i, bp['CHR'], row['Chr'], row['Gene Name'])
            CS_with_annot_uniq.iloc[j,15] = row['Gene Name']
            #pass
            #CS_with_annot_uniq.loc[j,'GENE']= row['Gene Name']
    j += 1

print(CS_with_annot_uniq.head(5))
print(CS_with_annot_uniq.shape)

In [None]:
CS_with_annot_uniq.to_csv('/mnt/mfs/statgen/tl3030/AD_2021_output/variants_with_CS_2021sumstat_97genes_with_annot.txt', index=False, sep='\t', mode='w')

In [None]:
CS_with_annot_uniq.sort_values(by=['CHR']) # sort the file by chromosome

In [None]:
num_of_CS_with_annot = CS_with_annot_uniq.drop_duplicates(subset=['CREDIBLE_SET', 'GENE'], keep = 'last').reset_index(drop = True)
print(num_of_CS_with_annot.shape)
num_of_CS_with_annot.sort_values(by=['GENE'])

In [None]:
num_of_gene_with_annot = CS_with_annot_uniq.drop_duplicates(subset=['GENE'], keep = 'last').reset_index(drop = True)
print(num_of_gene_with_annot.shape)

In [None]:
check_frequency_with_annot = CS_with_annot_uniq.groupby(["CREDIBLE_SET", "GENE"]).size().reset_index(name="Time")
print(check_frequency_with_annot.head(5))
print(check_frequency_with_annot.shape)

print(check_frequency_with_annot['Time'].sum())

In [None]:
CS_with_1_variant_with_annot = check_frequency_with_annot[check_frequency_with_annot['Time'] == 1]
print(CS_with_1_variant_with_annot.shape)
print(CS_with_1_variant_with_annot)

In [None]:
gene_with_annot = CS_with_annot_uniq['GENE']
print(gene_with_annot.shape)

gene_list_with_annot = gene_with_annot.drop_duplicates()
print(gene_list_with_annot)
print(gene_list_with_annot.shape)

###  Summary of Fine-mapping Result Without Functional Annotations

In [None]:
import os.path
# get the location of finemapping result files
file_without_annot_location = os.path.join('/mnt', 'mfs', 'statgen','tl3030','AD_2021_output','without_annot', 'finemap.*.gz')
print(file_without_annot_location)

import glob
# get a list of result file name
filenames_without_annot = glob.glob(file_without_annot_location)
print(len(filenames_without_annot))
print(filenames_without_annot)

In [None]:
snp_without_annot = pd.DataFrame()

for f in filenames_without_annot:
    # read the data
    outfile = pd.read_csv(f, delimiter = "\t")
    
    # filter out SNPs that has PIP >= 0.95
    significant = (outfile[outfile['PIP']>=0.95])
    snp_without_annot = snp_without_annot.append(significant)

print(snp_without_annot.head(5))

# remove duplicated SNPs
snp_without_annot_uniq = snp_without_annot.drop_duplicates(subset='SNP', keep='first')
print(snp_without_annot_uniq.head(5))
print(snp_without_annot_uniq.shape)

In [None]:
CS_without_annot = pd.DataFrame()

for f in filenames_without_annot:
    # read the data
    outfile = pd.read_csv(f, delimiter = "\t")
    
    # filter out SNPs that has CS
    significant = (outfile[outfile['CREDIBLE_SET']>0])
    CS_without_annot = CS_without_annot.append(significant)

# remove duplicated SNPs
CS_without_annot_uniq = CS_without_annot.drop_duplicates(subset='SNP', keep='first')
print(CS_without_annot_uniq.head(5))
print(CS_without_annot_uniq.shape)

In [None]:
# Read in the range file
region_range = pd.read_csv("/mnt/mfs/statgen/tl3030/range.csv").dropna()

bpcol = CS_without_annot_uniq[['CHR', 'BP']]

# Assign SNPs to the gene region that it belong to
j = 0
for i, bp in bpcol.iterrows():
    #print(i, bp['CHR'])
    for k,row in region_range.iterrows():
        if (bp['CHR'] == row['Chr']) and (bp['BP'] > row['start']) and (bp['BP'] < row['end']):
            CS_without_annot_uniq.iloc[j,15] = row['Gene Name']
            #CS_without_annot_uniq.loc[j,'GENE']= row['Gene Name']
    j += 1

print(CS_without_annot_uniq.head(5))
print(CS_without_annot_uniq.shape)

In [None]:
num_of_CS_without_annot = CS_without_annot_uniq.drop_duplicates(subset=['CREDIBLE_SET', 'GENE'], keep = 'last').reset_index(drop = True)
print(num_of_CS_without_annot.shape)
num_of_CS_without_annot.sort_values(by=['GENE'])

In [None]:
num_of_gene_without_annot = CS_without_annot_uniq.drop_duplicates(subset=['GENE'], keep = 'last').reset_index(drop = True)
print(num_of_gene_without_annot.shape)

In [None]:
check_frequency_without_annot = CS_without_annot_uniq.groupby(["CREDIBLE_SET", "GENE"]).size().reset_index(name="Time")
print(check_frequency_without_annot.head(5))
print(check_frequency_without_annot.shape)

print(check_frequency_without_annot['Time'].sum())

CS_with_1_variant_without_annot = check_frequency_without_annot[check_frequency_without_annot['Time'] == 1]
print(CS_with_1_variant_without_annot.shape)
print(CS_with_1_variant_without_annot)

In [None]:
gene_without_annot = CS_without_annot_uniq['GENE']
print(gene_without_annot.shape)

gene_list_without_annot = gene_without_annot.drop_duplicates()
print(gene_list_without_annot)
print(gene_list_without_annot.shape)